再びSparkにハマる

全開,Sparkをビルドから挑戦してハマったので...
Sparkのインストールにハマる[1] - なぜか数学者にはワイン好きが多い
Sparkのインストールにハマる[2] - なぜか数学者にはワイン好きが多い

今度はCDH用にビルド済みのものを使って手抜きをしようと思いました.

> wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.0-bin-cdh4.tgz
> su
# tar xf spark-1.0.0-bin-cdh4.tgz -C /usr/local
# cd /usr/local/
# ln -s spark-1.0.0-bin-cdh4/ spark
# exit

※まずはローカルモード

> /usr/local/spark/bin/run-example SparkPi 3


14/07/02 16:31:35 INFO SparkContext: Starting job: reduce at SparkPi.scala:35
14/07/02 16:31:35 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35) with 3 output partitions (allowLocal=false)
14/07/02 16:31:35 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkPi.scala:35)
14/07/02 16:31:35 INFO DAGScheduler: Parents of final stage: List()
14/07/02 16:31:35 INFO DAGScheduler: Missing parents: List()
14/07/02 16:31:35 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkPi.scala:31), which has no missing parents
14/07/02 16:31:35 INFO DAGScheduler: Submitting 3 missing tasks from Stage 0 (MappedRDD[1] at map at SparkPi.scala:31)
14/07/02 16:31:35 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
14/07/02 16:31:35 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
14/07/02 16:31:35 INFO TaskSetManager: Serialized task 0.0:0 as 1424 bytes in 2 ms
14/07/02 16:31:35 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
14/07/02 16:31:35 INFO TaskSetManager: Serialized task 0.0:1 as 1424 bytes in 0 ms
14/07/02 16:31:35 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)
14/07/02 16:31:35 INFO TaskSetManager: Serialized task 0.0:2 as 1424 bytes in 1 ms
14/07/02 16:31:35 INFO Executor: Running task ID 1
14/07/02 16:31:35 INFO Executor: Running task ID 0
14/07/02 16:31:35 INFO Executor: Running task ID 2
14/07/02 16:31:35 INFO Executor: Fetching http://192.168.0.10:48205/jars/spark-examples-1.0.0-hadoop2.0.0-mr1-cdh4.2.0.jar with timestamp 1404286295464
14/07/02 16:31:35 INFO Utils: Fetching http://192.168.0.10:48205/jars/spark-examples-1.0.0-hadoop2.0.0-mr1-cdh4.2.0.jar to /tmp/fetchFileTemp6531578495995963794.tmp
14/07/02 16:31:36 INFO Executor: Adding file:/tmp/spark-3044ce77-f9a9-4370-b6a4-2fd11a69f14a/spark-examples-1.0.0-hadoop2.0.0-mr1-cdh4.2.0.jar to class loader
14/07/02 16:31:36 INFO Executor: Serialized size of result for 2 is 675
14/07/02 16:31:36 INFO Executor: Serialized size of result for 0 is 675
14/07/02 16:31:36 INFO Executor: Serialized size of result for 1 is 675
14/07/02 16:31:36 INFO Executor: Sending result for 2 directly to driver
14/07/02 16:31:36 INFO Executor: Sending result for 0 directly to driver
14/07/02 16:31:36 INFO Executor: Sending result for 1 directly to driver
14/07/02 16:31:36 INFO Executor: Finished task ID 2
14/07/02 16:31:36 INFO Executor: Finished task ID 0
14/07/02 16:31:36 INFO Executor: Finished task ID 1
14/07/02 16:31:36 INFO TaskSetManager: Finished TID 2 in 611 ms on localhost (progress: 1/3)
14/07/02 16:31:36 INFO DAGScheduler: Completed ResultTask(0, 2)
14/07/02 16:31:36 INFO DAGScheduler: Completed ResultTask(0, 0)
14/07/02 16:31:36 INFO TaskSetManager: Finished TID 0 in 627 ms on localhost (progress: 2/3)
14/07/02 16:31:36 INFO DAGScheduler: Completed ResultTask(0, 1)
14/07/02 16:31:36 INFO TaskSetManager: Finished TID 1 in 619 ms on localhost (progress: 3/3)
14/07/02 16:31:36 INFO DAGScheduler: Stage 0 (reduce at SparkPi.scala:35) finished in 0.642 s
14/07/02 16:31:36 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
14/07/02 16:31:36 INFO SparkContext: Job finished: reduce at SparkPi.scala:35, took 0.774402315 s
Pi is roughly 3.1403066666666666

問題無し.

次にYARNクライアントモード.

> MASTER=yarn-client HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop /usr/local/spark/bin/run-example SparkPi 3
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
Run with --help for usage help or --verbose for debug output

アレ?
手抜きのせい?

このメッセージを出しているのはどこか探してみると,この辺り.

if (!Utils.classIsLoadable("org.apache.spark.deploy.yarn.Client") && !Utils.isTesting)
{
    val msg = "Could not load YARN classes. This copy of Spark may not have been compiled" + "with YARN support."

なので,jarの中にorg.apache.spark.deploy.yarn.Clientがあるか確認してみます.

> jar tvf lib/spark-assembly-1.0.0-hadoop2.0.0-mr1-cdh4.2.0.jar | grep org/apache/spark/deploy/yarn/Client.class
>

無いです!

前回,苦労の果てにちょっとうまく行ったっぽいヤツを見てみます.
[一旦解決] Sparkのインストールにハマる[5] - なぜか数学者にはワイン好きが多い

>jar tvf spark-assembly_2.10-0.9.1-hadoop2.2.0.jar | grep org/apache/spark/deploy/yarn/Client.class
> 31306 Thu Mar 27 05:50:52 JST 2014 org/apache/spark/deploy/yarn/Client.class

有ります!

手抜きしちゃいけないということで...安易にプリビルドやバイナリインストールに頼らずに,ソースインストールも頑張ることにします.