Sparkのビルドが簡単すぎた件について


手抜きはいくないということで...
再びSparkにハマる - なぜか数学者にはワイン好きが多い

ちゃんとソースインストールします.

現在のHadoop環境を極力活かしたいので,Sparkのローカルモードやスタンドアロンモード,Mesosモードは考えずに,YARNモードを作ります.HadoopはCDH4.4が稼働しています.
HadoopやCDHのバージョンの指定方法はドキュメントにあるので,それをそっくり真似して見ます.
Redirecting…

> wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.0.tgz
> tar xf spark-1.0.0.tgz
> cd spark-1.0.0
> export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
> mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.4.0 -DskipTests clean package

Downloading: http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-yarn-common/2.0.0-cdh4.4.0/hadoop-yarn-common-2.0.0-cdh4.4.0.pom
(中略)

※↓これはよく分からなかったのですが進んでるので無視
[WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile
(中略)

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project spark-core_2.10: An Ant BuildException has occured: Please set the SCALA_HOME
 (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry.
[ERROR] around Ant part ...<fail message="Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry.">... @ 6:126 in /home/foo/spark-1.0.0/core/target/antrun/build-main.xml

ああ,先にScalaを入れないとダメなのね.

> wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
> su
# tar xf scala-2.10.4.tgz -C /usr/local
# ln -sv /usr/local/scala-2.10.4 /usr/local/scala
`/usr/local/scala' -> `/usr/local/scala-2.10.4'
# exit

再チャレンジ.

> export SCALA_HOME=/usr/local/scala
> mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.4.0 -DskipTests package
(中略)
[ERROR] /home/foo/spark-1.0.0/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:36: object AMResponse is not a member of package org.apache.hadoop.yarn.api.records
[ERROR] import org.apache.hadoop.yarn.api.records.{AMResponse, ApplicationAttemptId}
[ERROR]        ^
[ERROR] /home/foo/spark-1.0.0/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:110: value getAMResponse is not a member of org.apache.hadoop.yarn.api.protocolrecords.AllocateResponse
[ERROR]     val amResp = allocateExecutorResources(executorsToRequest).getAMResponse
[ERROR]                                                                ^
[ERROR] two errors found

エラー.

しかし,これは前回のノウハウがある!
[一旦解決] Sparkのインストールにハマる[5] - なぜか数学者にはワイン好きが多い

そもそもエラーになっている,メソッド呼び出し

[error] /usr/local/spark-0.9.1-bin-hadoop2/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:106: value getAMResponse is not a member of org.apache.hadoop.yarn.api.protocolrecords.AllocateResponse
[error] val amResp = allocateWorkerResources(workersToRequest).getAMResponse

が,不必要なようです.そう言われると,確かに

そこで
spark-1.0.0/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
を編集.

//import org.apache.hadoop.yarn.api.records.{AMResponse, ApplicationAttemptId}
import org.apache.hadoop.yarn.api.records.ApplicationAttemptId

    // Keep polling the Resource Manager for containers
//    val amResp = allocateExecutorResources(executorsToRequest).getAMResponse
    val amResp = allocateExecutorResources(executorsToRequest)


再挑戦.

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .......................... SUCCESS [3.493s]
[INFO] Spark Project Core ................................ SUCCESS [18.886s]
[INFO] Spark Project Bagel ............................... SUCCESS [2.343s]
[INFO] Spark Project GraphX .............................. SUCCESS [3.148s]
[INFO] Spark Project ML Library .......................... SUCCESS [2.899s]
[INFO] Spark Project Streaming ........................... SUCCESS [3.553s]
[INFO] Spark Project Tools ............................... SUCCESS [0.828s]
[INFO] Spark Project Catalyst ............................ SUCCESS [3.429s]
[INFO] Spark Project SQL ................................. SUCCESS [1.317s]
[INFO] Spark Project Hive ................................ SUCCESS [4.915s]
[INFO] Spark Project REPL ................................ SUCCESS [1.466s]
[INFO] Spark Project YARN Parent POM ..................... SUCCESS [1.480s]
[INFO] Spark Project YARN Alpha API ...................... SUCCESS [36.159s]
[INFO] Spark Project Assembly ............................ SUCCESS [39.842s]
[INFO] Spark Project External Twitter .................... SUCCESS [23.469s]
[INFO] Spark Project External Kafka ...................... SUCCESS [33.848s]
[INFO] Spark Project External Flume ...................... SUCCESS [27.625s]
[INFO] Spark Project External ZeroMQ ..................... SUCCESS [26.915s]
[INFO] Spark Project External MQTT ....................... SUCCESS [27.329s]
[INFO] Spark Project Examples ............................ SUCCESS [1:45.440s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 6:09.339s
[INFO] Finished at: Thu Jul 03 13:27:09 JST 2014
[INFO] Final Memory: 55M/968M
[INFO] ------------------------------------------------------------------------

OK.

とりあえずPiを動かしたいです.

まずYARN Clientモード.

HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client examples/target/spark
-examples_2.10-1.0.0.jar 2 

※クライアントモードなので,ドライバーがクライアントサーバから通信準備をする
14/07/03 13:36:00 INFO Remoting: Starting remoting
14/07/03 13:36:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@myclientnode:58567]
14/07/03 13:36:00 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@myclientnode:58567]


14/07/03 13:36:01 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/07/03 13:36:01 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.

※Hadoopのデータノード...というかApplication Managerの数をゲット
14/07/03 13:36:02 INFO yarn.Client: Got Cluster metric info from ASM, numNodeManagers = 11


14/07/03 13:36:02 INFO yarn.Client: Preparing Local resources

※SparkのライブラリをHDFSの置いてデータノードで共有
14/07/03 13:36:03 INFO yarn.Client: Uploading file:/home/foo/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.0.0-cdh4.4.0.jar to hdfs://mycluster/user/hadoop/.sparkStaging/application_1401264313901_22542/spark-assembly-1.0.0-hadoop2.0.0-cdh4.4.0.jar

※アプリケーションマネージャにジョブをお任せ
14/07/03 13:36:05 INFO yarn.Client: Submitting application to ASM
14/07/03 13:36:05 INFO client.YarnClientImpl: Submitted application application_1401264313901_22542 to ResourceManager at myresourcemanager/192.168.26.29:8040
14/07/03 13:36:05 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: 
         appMasterRpcPort: 0
         appStartTime: 1404362165270
         yarnAppState: ACCEPTED


14/07/03 13:36:12 INFO cluster.YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@mydatanode01:40692/user/Executor#81931987] with ID 2
14/07/03 13:36:12 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 0 on executor 2: mydatanode01 (PROCESS_LOCAL)
14/07/03 13:36:12 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 1404 bytes in 2 ms
14/07/03 13:36:13 INFO storage.BlockManagerInfo: Registering block manager mydatanode01:52317 with 589.2 MB RAM
14/07/03 13:36:13 INFO cluster.YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@mydatanode02:60460/user/Executor#-1060070018] with ID 1
14/07/03 13:36:13 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 1 on executor 1: mydatanode02 (PROCESS_LOCAL)
14/07/03 13:36:13 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 1404 bytes in 0 ms
14/07/03 13:36:13 INFO storage.BlockManagerInfo: Registering block manager mydatanode02:38750 with 589.2 MB RAM


14/07/03 13:36:14 INFO scheduler.TaskSetManager: Finished TID 1 in 922 ms on mydatanode02 (progress: 2/2)
14/07/03 13:36:14 INFO scheduler.DAGScheduler: Stage 0 (reduce at SparkPi.scala:35) finished in 2.516 s
14/07/03 13:36:14 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
14/07/03 13:36:14 INFO spark.SparkContext: Job finished: reduce at parkPi.scala:35, took 2.653324513 s

※円周率の近似値
Pi is roughly 3.14024

次にYARN Clusterモード.

HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster examples/target/spar
k-examples_2.10-1.0.0.jar 2

※アプリケーションマネージャの走っているデータノードの数をゲット
14/07/03 13:48:12 INFO yarn.Client: Got Cluster metric info from ASM, numNodeManagers = 11
14/07/03 13:48:12 INFO yarn.Client: Queue info ... queueName = default, queueCurrentCapacity = 0.0, queueMaxCapacity = 1.0,
      queueApplicationCount = 10000, queueChildQueueCount = 0


14/07/03 13:48:12 INFO yarn.Client: Preparing Local resources

※実行するアプリをHDFSに置いてデータノードで共有
14/07/03 13:48:13 INFO yarn.Client: Uploading file:/home/foo/spark-1.0.0/examples/target/spark-examples_2.10-1.0.0.jar to hdfs://mycluster/user/hadoop/.sparkStaging/application_1401264313901_22543/spark-examples_2.10-1.0.0.jar

※SparkのライブラリをHDFSの置いてデータノードで共有
14/07/03 13:48:13 INFO yarn.Client: Uploading file:/home/foo/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.0.0-cdh4.4.0.jar to hdfs://mycluster/user/hadoop/.sparkStaging/application_1401264313901_22543/spark-assembly-1.0.0-hadoop2.0.0-cdh4.4.0.jar

※データノードに処理を任せて分離
14/07/03 13:48:15 INFO yarn.Client: Submitting application to ASM
14/07/03 13:48:15 INFO client.YarnClientImpl: Submitted application application_1401264313901_22543 to ResourceManager at myresourcemanager/192.168.26.29:8040
14/07/03 13:48:16 INFO yarn.Client: Application report from ASM: 
         application identifier: application_1401264313901_22543
         appId: 22543
         clientToken: null
         appDiagnostics: 
         appMasterHost: N/A
         appQueue: default
         appMasterRpcPort: 0
         appStartTime: 1404362895315
         yarnAppState: ACCEPTED
         distributedFinalState: UNDEFINED
         appTrackingUrl: myresourcemanager:8088/proxy/application_1401264313901_22543/
         appUser: hadoop

※定期的にデータノードと通信して様子をうかがう
14/07/03 13:48:20 INFO yarn.Client: Application report from ASM: 
         application identifier: application_1401264313901_22543
         appId: 22543
         clientToken: null
         appDiagnostics: 
         appMasterHost: mydatanode03
         appQueue: default
         appMasterRpcPort: 0
         appStartTime: 1404362895315
         yarnAppState: RUNNING
         distributedFinalState: UNDEFINED
         appTrackingUrl: myresourcemanager:8088/proxy/application_1401264313901_22543/
         appUser: hadoop


14/07/03 13:48:25 INFO yarn.Client: Application report from ASM: 
         application identifier: application_1401264313901_22543
         appId: 22543
         clientToken: null
         appDiagnostics: 
         appMasterHost: mydatanode03
         appQueue: default
         appMasterRpcPort: 0
         appStartTime: 1404362895315
         yarnAppState: FINISHED
         distributedFinalState: SUCCEEDED
         appTrackingUrl: 
         appUser: hadoop

実行されたmydatanode03に移って...

cat userlogs/application_1401264313901_22543/*/stdout
Pi is roughly 3.14018

問題なしです.
次はexampleじゃなくて何かサンプルプログラムを動かしてみます.