Spark SQL with Hadoop Hiveにハマる

Sparkのビルドが簡単に成功したので,調子に乗りました.
Sparkのビルドが簡単すぎた件について - なぜか数学者にはワイン好きが多い

Spark-shellを使って既にデータがHDFS上に大量に蓄積されているHiveにアクセスしてみようとしてハマりました.

scala> val hiveContext=new org.apache.spark.sql.hive.HiveContext(sc)
<console>:12: error: object hive is not a member of package org.apache.spark.sql
       val hiveContext=new org.apache.spark.sql.hive.HiveContext(sc)

org.apache.spark.sql.hiveが無いだと...?と思ってspark-assembly-1.0.0-hadoop2.0.0-mr1-cdh4.2.0.jarの中を見てみると,確かに無いです.
そういえば,と思ってSpark SQL and DataFrames - Spark 2.3.2 Documentationを見てみると,

Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, it is not included in the default Spark assembly. In order to use Hive you must first run ‘SPARK_HIVE=true sbt/sbt assembly/assembly’ (or use -Phive for maven).

と書いてありました.そこで,前回,

> mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.4.0 -DskipTests clean package

と作ったものを,改めて

> time mvn -Phive -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.4.0 -DskipTests clean package

で作りなおしました.
特にトラブルもなく,

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .......................... SUCCESS [3.259s]
[INFO] Spark Project Core ................................ SUCCESS [3:32.873s]
[INFO] Spark Project Bagel ............................... SUCCESS [26.504s]
[INFO] Spark Project GraphX .............................. SUCCESS [1:18.451s]
[INFO] Spark Project ML Library .......................... SUCCESS [1:31.125s]
[INFO] Spark Project Streaming ........................... SUCCESS [1:41.268s]
[INFO] Spark Project Tools ............................... SUCCESS [15.847s]
[INFO] Spark Project Catalyst ............................ SUCCESS [1:36.652s]
[INFO] Spark Project SQL ................................. SUCCESS [1:23.676s]
[INFO] Spark Project Hive ................................ SUCCESS [1:40.685s]
[INFO] Spark Project REPL ................................ SUCCESS [47.597s]
[INFO] Spark Project YARN Parent POM ..................... SUCCESS [1.770s]
[INFO] Spark Project YARN Alpha API ...................... SUCCESS [39.965s]
[INFO] Spark Project Assembly ............................ SUCCESS [34.985s]
[INFO] Spark Project External Twitter .................... SUCCESS [22.567s]
[INFO] Spark Project External Kafka ...................... SUCCESS [24.920s]
[INFO] Spark Project External Flume ...................... SUCCESS [26.224s]
[INFO] Spark Project External ZeroMQ ..................... SUCCESS [24.192s]
[INFO] Spark Project External MQTT ....................... SUCCESS [21.597s]
[INFO] Spark Project Examples ............................ SUCCESS [1:03.254s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 18:38.366s
[INFO] Finished at: Mon Jul 07 12:56:23 JST 2014
[INFO] Final Memory: 57M/903M
[INFO] ------------------------------------------------------------------------

real    18m40.737s
user    44m34.005s
sys     0m24.722s

くらいで無事ビルド終了.
HadoopのYARN利用で実行してみました.

> ./bin/spark-shell --master yarn-client
Spark assembly has been built with Hive, including Datanucleus jars on classpath
14/07/07 13:00:37 INFO spark.SecurityManager: Changing view acls to: hadoop
14/07/07 13:00:37 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop)
14/07/07 13:00:37 INFO spark.HttpServer: Starting HTTP Server
14/07/07 13:00:37 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/07 13:00:37 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:43672
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.0.0
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45)

scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@4a984a6f

scala> import hiveContext._
import hiveContext._

scala> hql("SHOW TABLES")

※(大量のエラーログ)

14/07/07 13:03:38 ERROR hive.HiveContext: 
======================
HIVE FAILURE OUTPUT
======================
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

======================
END HIVE FAILURE OUTPUT
======================
          
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

よく分からなかったのですが,設定が足りないのかと思ってまたドキュメントを読んでみました.

Configuration of Hive is done by placing your hive-site.xml file in conf/.

Hiveのインストールディレクトリと別に,Sparkのインストールディレクトリのconf以下にhive-site.xmlが必要...?
トライ.

> cp -pv /usr/local/hive/conf/hive-site.xml conf/
`/usr/local/hive/conf/hive-site.xml' -> `conf/hive-site.xml'

結果的には,変わらず...
※ただし,これは必要な設定でした

scala> hql("SHOW TABLES")
14/07/07 13:28:00 INFO parse.ParseDriver: Parsing command: SHOW TABLES
14/07/07 13:28:00 INFO parse.ParseDriver: Parse Completed

(中略)

14/07/07 13:28:01 ERROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
(中略)
Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BoneCP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
(中略)
14/07/07 13:28:01 ERROR hive.HiveContext: 
======================
HIVE FAILURE OUTPUT
======================
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

======================
END HIVE FAILURE OUTPUT
======================

mysql-connectorのjarの場所が分かってないね...ということで,それらしいオプションを指定してみました.

> ./bin/spark-shell --master yarn-client --jars /usr/local/hive/lib/mysql-connector-java.jar 
Spark assembly has been built with Hive, including Datanucleus jars on classpath
14/07/07 16:24:09 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:42755
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.0.0
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
(中略)
14/07/07 16:24:16 INFO spark.SparkContext: Added JAR file:/usr/local/hive/lib/mysql-connector-java.jar at http://192.168.26.25:55431/jars/mysql-connector-java.jar with timestamp 140471
7856329

scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
(中略)
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@ea51a00

scala> import hiveContext._
scala> hql("SHOW TABLES")

Caused by: java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://mymetastoreserver/metastore?createDatabaseIfNotExist=true, username = hive. Terminating connection pool. Original Exception: ------
java.sql.SQLException: No suitable driver found for jdbc:mysql://mynamenode/metastore?createDatabaseIfNotExist=true

これは訳が分かりませんでした.JDBC urlの内容をいろいろ変えてもダメ.

最終的な回答.

> ./bin/spark-shell --master yarn-client --jars /usr/local/hive/lib/mysql-connector-java.jar 

じゃなくて,

> ./bin/spark-shell --master yarn-client --driver-class-path /usr/local/hive/lib/mysql-connector-java.jar

が正解.
yarn-clientモードの時は,コマンドを叩いたマシンでドライバが実行されるので,--driver-class-pathでjavaに直接jarの場所を教えてあげないとダメなようです.yarn-clusterモードの時は,--jarsで共有する感じで.