Sparkのインストールにハマる[1]

MahoutがHadoopを捨てSparkベースに

Apache Mahout

25 April 2014 - Goodbye MapReduce

The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them.

We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark. 

というわけで,Sparkをインストールしようとしたら,ハマりました...

$ wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.1-bin-hadoop2.tgz
# tar xf spark-0.9.1-bin-hadoop2.tgz -C /usr/local
# adduser spark
# groupadd spark
# usermod spark -G spark
# chown -R spark:spark /usr/local/spark-0.9.1-bin-hadoop2/
# ln -s /usr/local/spark-0.9.1-bin-hadoop2/ /usr/local/spark
$ cd /usr/local/spark/conf/
$ cp -pv log4j.properties.template log4j.properties
$ cd ..
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.4.0 SPARK_YARN=true sbt/sbt assembly
Launching sbt from sbt/sbt-launch-0.12.4.jar
Invalid or corrupt jarfile sbt/sbt-launch-0.12.4.jar

あれ?

$ ls -l sbt/
total 4
-rwxrwxr-x 1 spark spark 2145 Mar 27 14:44 sbt
-rw-r--r-- 1 spark users    0 May  9 20:41 sbt-launch-0.12.4.jar

ああ,これは自分でプロクシの設定を入れるのを忘れていました.再実行します.

$ rm sbt/sbt-launch-0.12.4.jar
$ export http_proxy=http://myproxy:8080
$ export https_proxy=https://myproxy:8080
$ time SPARK_HADOOP_VERSION=2.0.0-cdh4.4.0 SPARK_YARN=true sbt/sbt assembly                                                                                       
Attempting to fetch sbt
######################################################################## 100.0%
Launching sbt from sbt/sbt-launch-0.12.4.jar
[info] Loading project definition from /usr/local/spark-0.9.1-bin-hadoop2/project/project
[info] Updating {file:/usr/local/spark-0.9.1-bin-hadoop2/project/project/}default-7e9fb9...
[info] Resolving org.scala-sbt#precompiled-2_10_1;0.12.4 ...
[info] Done updating.
[info] Compiling 1 Scala source to /usr/local/spark-0.9.1-bin-hadoop2/project/project/target/scala-2.9.2/sbt-0.12/classes...
[info] Loading project definition from /usr/local/spark-0.9.1-bin-hadoop2/project
[info] Updating {file:/usr/local/spark-0.9.1-bin-hadoop2/project/}plugins...
[info] Resolving org.scala-sbt#precompiled-2_10_1;0.12.4 ...
[info] Done updating.
[info] Compiling 1 Scala source to /usr/local/spark-0.9.1-bin-hadoop2/project/target/scala-2.9.2/sbt-0.12/classes...
[info] Set current project to root (in build file:/usr/local/spark-0.9.1-bin-hadoop2/)
[info] Updating {file:/usr/local/spark-0.9.1-bin-hadoop2/}core...
[info] Resolving org.apache.hadoop#hadoop-client;2.0.0-cdh4.4.0 ...
[error] Server access Error: Connection timed out url=https://oss.sonatype.org/content/repositories/snapshots/org/apache/hadoop/hadoop-client/2.0.0-cdh4.4.0/hadoop-client-2.0.0-cdh4.
4.0.pom
[error] Server access Error: Connection timed out url=https://oss.sonatype.org/service/local/staging/deploy/maven2/org/apache/hadoop/hadoop-client/2.0.0-cdh4.4.0/hadoop-client-2.0.0-
cdh4.4.0.pom
[error] Server access Error: Connection timed out url=https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hadoop/hadoop-client/2.0.0-cdh4.4.0/hadoop-client-2.0.0-cdh4.4.0.pom
[warn]        module not found: org.apache.hadoop#hadoop-client;2.0.0-cdh4.4.0
[warn] ==== local: tried
[warn]   /home/spark/.ivy2/local/org.apache.hadoop/hadoop-client/2.0.0-cdh4.4.0/ivys/ivy.xml
[warn] ==== Local Maven Repo: tried
[warn] ==== sonatype-snapshots: tried
[warn]   https://oss.sonatype.org/content/repositories/snapshots/org/apache/hadoop/hadoop-client/2.0.0-cdh4.4.0/hadoop-client-2.0.0-cdh4.4.0.pom
[warn] ==== sonatype-staging: tried
[warn]   https://oss.sonatype.org/service/local/staging/deploy/maven2/org/apache/hadoop/hadoop-client/2.0.0-cdh4.4.0/hadoop-client-2.0.0-cdh4.4.0.pom
[warn] ==== JBoss Repository: tried
[warn]   http://repository.jboss.org/nexus/content/repositories/releases/org/apache/hadoop/hadoop-client/2.0.0-cdh4.4.0/hadoop-client-2.0.0-cdh4.4.0.pom
[warn] ==== Cloudera Repository: tried
[warn]   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hadoop/hadoop-client/2.0.0-cdh4.4.0/hadoop-client-2.0.0-cdh4.4.0.pom
[warn] ==== public: tried
[warn]   http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.0.0-cdh4.4.0/hadoop-client-2.0.0-cdh4.4.0.pom
[info] Resolving org.apache.derby#derby;10.4.2.0 ...
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: org.apache.hadoop#hadoop-client;2.0.0-cdh4.4.0: not found
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
sbt.ResolveException: unresolved dependency: org.apache.hadoop#hadoop-client;2.0.0-cdh4.4.0: not found
        at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:214)
        at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:122)
        at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:121)
        at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:117)

え?肝心のHadoopが無いって?
Connection timed out url=https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hadoop/hadoop-client/2.0.0-cdh4.4.0/hadoop-client-2.0.0-cdh4.4.0.pomだそうなので,wgetで本当に無いのか確認します.

$ wget https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hadoop/hadoop-client/2.0.0-cdh4.4.0/hadoop-client-2.0.0-cdh4.4.0.pom -O /dev/null
--2014-05-10 16:05:48--  https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hadoop/hadoop-client/2.0.0-cdh4.4.0/hadoop-client-2.0.0-cdh4.4.0.pom
Connecting to 10.129.80.73:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: 9831 (9.6K) [application/x-maven-pom+xml]
Saving to: “/dev/null”

取れますやん!!!!
sbtがプロクシを認識していないのかとも思ったのですが,
This page has moved

HTTP/HTTPS/FTP Proxy

On Unix, sbt will pick up any HTTP, HTTPS, or FTP proxy settings from the standard http_proxy, https_proxy, and ftp_proxy environment variables. If you are behind a proxy requiring authentication, your sbt script must also pass flags to set the http.proxyUser and http.proxyPassword properties for HTTP, ftp.proxyUser and ftp.proxyPassword properties for FTP, or https.proxyUser and https.proxyPassword properties for HTTPS.

デフォルトで環境変数を見るって書いていますし,念の為にsbt/sbtの中の一番下の行のjavaを呼んでいるところに-Dhttp.proxyHost=myproxyなどと追加してもダメでした.

$ emacs sbt/sbt

printf "Launching sbt from ${JAR}\n"
java \
  -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \
  -jar ${JAR} -Dhttp.proxyHost=10.129.80.73 -Dhttp.proxyPort=8080 -Dhttps.proxyHost=10.129.80.73 -Dhttps.proxyPort=8080\
  "$@"

#java \
#  -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \
#  -jar ${JAR} \
#  "$@"

これは手強そうな予感がします.