Sparkのビルドが簡単すぎた件について
手抜きはいくないということで...
再びSparkにハマる - なぜか数学者にはワイン好きが多い
ちゃんとソースインストールします.
現在のHadoop環境を極力活かしたいので,Sparkのローカルモードやスタンドアロンモード,Mesosモードは考えずに,YARNモードを作ります.HadoopはCDH4.4が稼働しています.
HadoopやCDHのバージョンの指定方法はドキュメントにあるので,それをそっくり真似して見ます.
Redirecting…
> wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.0.tgz > tar xf spark-1.0.0.tgz > cd spark-1.0.0 > export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" > mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.4.0 -DskipTests clean package Downloading: http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-yarn-common/2.0.0-cdh4.4.0/hadoop-yarn-common-2.0.0-cdh4.4.0.pom (中略) ※↓これはよく分からなかったのですが進んでるので無視 [WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile (中略) [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project spark-core_2.10: An Ant BuildException has occured: Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry. [ERROR] around Ant part ...<fail message="Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry.">... @ 6:126 in /home/foo/spark-1.0.0/core/target/antrun/build-main.xml
ああ,先にScalaを入れないとダメなのね.
> wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz > su # tar xf scala-2.10.4.tgz -C /usr/local # ln -sv /usr/local/scala-2.10.4 /usr/local/scala `/usr/local/scala' -> `/usr/local/scala-2.10.4' # exit
再チャレンジ.
> export SCALA_HOME=/usr/local/scala > mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.4.0 -DskipTests package (中略) [ERROR] /home/foo/spark-1.0.0/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:36: object AMResponse is not a member of package org.apache.hadoop.yarn.api.records [ERROR] import org.apache.hadoop.yarn.api.records.{AMResponse, ApplicationAttemptId} [ERROR] ^ [ERROR] /home/foo/spark-1.0.0/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:110: value getAMResponse is not a member of org.apache.hadoop.yarn.api.protocolrecords.AllocateResponse [ERROR] val amResp = allocateExecutorResources(executorsToRequest).getAMResponse [ERROR] ^ [ERROR] two errors found
エラー.
しかし,これは前回のノウハウがある!
[一旦解決] Sparkのインストールにハマる[5] - なぜか数学者にはワイン好きが多い
そもそもエラーになっている,メソッド呼び出し
[error] /usr/local/spark-0.9.1-bin-hadoop2/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:106: value getAMResponse is not a member of org.apache.hadoop.yarn.api.protocolrecords.AllocateResponse
[error] val amResp = allocateWorkerResources(workersToRequest).getAMResponseが,不必要なようです.そう言われると,確かに
そこで
spark-1.0.0/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
を編集.
//import org.apache.hadoop.yarn.api.records.{AMResponse, ApplicationAttemptId} import org.apache.hadoop.yarn.api.records.ApplicationAttemptId // Keep polling the Resource Manager for containers // val amResp = allocateExecutorResources(executorsToRequest).getAMResponse val amResp = allocateExecutorResources(executorsToRequest)
再挑戦.
[INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .......................... SUCCESS [3.493s] [INFO] Spark Project Core ................................ SUCCESS [18.886s] [INFO] Spark Project Bagel ............................... SUCCESS [2.343s] [INFO] Spark Project GraphX .............................. SUCCESS [3.148s] [INFO] Spark Project ML Library .......................... SUCCESS [2.899s] [INFO] Spark Project Streaming ........................... SUCCESS [3.553s] [INFO] Spark Project Tools ............................... SUCCESS [0.828s] [INFO] Spark Project Catalyst ............................ SUCCESS [3.429s] [INFO] Spark Project SQL ................................. SUCCESS [1.317s] [INFO] Spark Project Hive ................................ SUCCESS [4.915s] [INFO] Spark Project REPL ................................ SUCCESS [1.466s] [INFO] Spark Project YARN Parent POM ..................... SUCCESS [1.480s] [INFO] Spark Project YARN Alpha API ...................... SUCCESS [36.159s] [INFO] Spark Project Assembly ............................ SUCCESS [39.842s] [INFO] Spark Project External Twitter .................... SUCCESS [23.469s] [INFO] Spark Project External Kafka ...................... SUCCESS [33.848s] [INFO] Spark Project External Flume ...................... SUCCESS [27.625s] [INFO] Spark Project External ZeroMQ ..................... SUCCESS [26.915s] [INFO] Spark Project External MQTT ....................... SUCCESS [27.329s] [INFO] Spark Project Examples ............................ SUCCESS [1:45.440s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 6:09.339s [INFO] Finished at: Thu Jul 03 13:27:09 JST 2014 [INFO] Final Memory: 55M/968M [INFO] ------------------------------------------------------------------------
OK.
とりあえずPiを動かしたいです.
まずYARN Clientモード.
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client examples/target/spark -examples_2.10-1.0.0.jar 2 ※クライアントモードなので,ドライバーがクライアントサーバから通信準備をする 14/07/03 13:36:00 INFO Remoting: Starting remoting 14/07/03 13:36:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@myclientnode:58567] 14/07/03 13:36:00 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@myclientnode:58567] 14/07/03 13:36:01 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited. 14/07/03 13:36:01 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started. ※Hadoopのデータノード...というかApplication Managerの数をゲット 14/07/03 13:36:02 INFO yarn.Client: Got Cluster metric info from ASM, numNodeManagers = 11 14/07/03 13:36:02 INFO yarn.Client: Preparing Local resources ※SparkのライブラリをHDFSの置いてデータノードで共有 14/07/03 13:36:03 INFO yarn.Client: Uploading file:/home/foo/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.0.0-cdh4.4.0.jar to hdfs://mycluster/user/hadoop/.sparkStaging/application_1401264313901_22542/spark-assembly-1.0.0-hadoop2.0.0-cdh4.4.0.jar ※アプリケーションマネージャにジョブをお任せ 14/07/03 13:36:05 INFO yarn.Client: Submitting application to ASM 14/07/03 13:36:05 INFO client.YarnClientImpl: Submitted application application_1401264313901_22542 to ResourceManager at myresourcemanager/192.168.26.29:8040 14/07/03 13:36:05 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: 0 appStartTime: 1404362165270 yarnAppState: ACCEPTED 14/07/03 13:36:12 INFO cluster.YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@mydatanode01:40692/user/Executor#81931987] with ID 2 14/07/03 13:36:12 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 0 on executor 2: mydatanode01 (PROCESS_LOCAL) 14/07/03 13:36:12 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 1404 bytes in 2 ms 14/07/03 13:36:13 INFO storage.BlockManagerInfo: Registering block manager mydatanode01:52317 with 589.2 MB RAM 14/07/03 13:36:13 INFO cluster.YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@mydatanode02:60460/user/Executor#-1060070018] with ID 1 14/07/03 13:36:13 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 1 on executor 1: mydatanode02 (PROCESS_LOCAL) 14/07/03 13:36:13 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 1404 bytes in 0 ms 14/07/03 13:36:13 INFO storage.BlockManagerInfo: Registering block manager mydatanode02:38750 with 589.2 MB RAM 14/07/03 13:36:14 INFO scheduler.TaskSetManager: Finished TID 1 in 922 ms on mydatanode02 (progress: 2/2) 14/07/03 13:36:14 INFO scheduler.DAGScheduler: Stage 0 (reduce at SparkPi.scala:35) finished in 2.516 s 14/07/03 13:36:14 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/07/03 13:36:14 INFO spark.SparkContext: Job finished: reduce at parkPi.scala:35, took 2.653324513 s ※円周率の近似値 Pi is roughly 3.14024
次にYARN Clusterモード.
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster examples/target/spar k-examples_2.10-1.0.0.jar 2 ※アプリケーションマネージャの走っているデータノードの数をゲット 14/07/03 13:48:12 INFO yarn.Client: Got Cluster metric info from ASM, numNodeManagers = 11 14/07/03 13:48:12 INFO yarn.Client: Queue info ... queueName = default, queueCurrentCapacity = 0.0, queueMaxCapacity = 1.0, queueApplicationCount = 10000, queueChildQueueCount = 0 14/07/03 13:48:12 INFO yarn.Client: Preparing Local resources ※実行するアプリをHDFSに置いてデータノードで共有 14/07/03 13:48:13 INFO yarn.Client: Uploading file:/home/foo/spark-1.0.0/examples/target/spark-examples_2.10-1.0.0.jar to hdfs://mycluster/user/hadoop/.sparkStaging/application_1401264313901_22543/spark-examples_2.10-1.0.0.jar ※SparkのライブラリをHDFSの置いてデータノードで共有 14/07/03 13:48:13 INFO yarn.Client: Uploading file:/home/foo/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.0.0-cdh4.4.0.jar to hdfs://mycluster/user/hadoop/.sparkStaging/application_1401264313901_22543/spark-assembly-1.0.0-hadoop2.0.0-cdh4.4.0.jar ※データノードに処理を任せて分離 14/07/03 13:48:15 INFO yarn.Client: Submitting application to ASM 14/07/03 13:48:15 INFO client.YarnClientImpl: Submitted application application_1401264313901_22543 to ResourceManager at myresourcemanager/192.168.26.29:8040 14/07/03 13:48:16 INFO yarn.Client: Application report from ASM: application identifier: application_1401264313901_22543 appId: 22543 clientToken: null appDiagnostics: appMasterHost: N/A appQueue: default appMasterRpcPort: 0 appStartTime: 1404362895315 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: myresourcemanager:8088/proxy/application_1401264313901_22543/ appUser: hadoop ※定期的にデータノードと通信して様子をうかがう 14/07/03 13:48:20 INFO yarn.Client: Application report from ASM: application identifier: application_1401264313901_22543 appId: 22543 clientToken: null appDiagnostics: appMasterHost: mydatanode03 appQueue: default appMasterRpcPort: 0 appStartTime: 1404362895315 yarnAppState: RUNNING distributedFinalState: UNDEFINED appTrackingUrl: myresourcemanager:8088/proxy/application_1401264313901_22543/ appUser: hadoop 14/07/03 13:48:25 INFO yarn.Client: Application report from ASM: application identifier: application_1401264313901_22543 appId: 22543 clientToken: null appDiagnostics: appMasterHost: mydatanode03 appQueue: default appMasterRpcPort: 0 appStartTime: 1404362895315 yarnAppState: FINISHED distributedFinalState: SUCCEEDED appTrackingUrl: appUser: hadoop
実行されたmydatanode03に移って...
cat userlogs/application_1401264313901_22543/*/stdout Pi is roughly 3.14018
問題なしです.
次はexampleじゃなくて何かサンプルプログラムを動かしてみます.