Spark SQL with Hadoop Hiveにハマる
Sparkのビルドが簡単に成功したので,調子に乗りました.
Sparkのビルドが簡単すぎた件について - なぜか数学者にはワイン好きが多い
Spark-shellを使って既にデータがHDFS上に大量に蓄積されているHiveにアクセスしてみようとしてハマりました.
scala> val hiveContext=new org.apache.spark.sql.hive.HiveContext(sc) <console>:12: error: object hive is not a member of package org.apache.spark.sql val hiveContext=new org.apache.spark.sql.hive.HiveContext(sc)
org.apache.spark.sql.hiveが無いだと...?と思ってspark-assembly-1.0.0-hadoop2.0.0-mr1-cdh4.2.0.jarの中を見てみると,確かに無いです.
そういえば,と思ってSpark SQL and DataFrames - Spark 2.3.2 Documentationを見てみると,
Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, it is not included in the default Spark assembly. In order to use Hive you must first run ‘SPARK_HIVE=true sbt/sbt assembly/assembly’ (or use -Phive for maven).
と書いてありました.そこで,前回,
> mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.4.0 -DskipTests clean package
と作ったものを,改めて
> time mvn -Phive -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.4.0 -DskipTests clean package
で作りなおしました.
特にトラブルもなく,
[INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .......................... SUCCESS [3.259s] [INFO] Spark Project Core ................................ SUCCESS [3:32.873s] [INFO] Spark Project Bagel ............................... SUCCESS [26.504s] [INFO] Spark Project GraphX .............................. SUCCESS [1:18.451s] [INFO] Spark Project ML Library .......................... SUCCESS [1:31.125s] [INFO] Spark Project Streaming ........................... SUCCESS [1:41.268s] [INFO] Spark Project Tools ............................... SUCCESS [15.847s] [INFO] Spark Project Catalyst ............................ SUCCESS [1:36.652s] [INFO] Spark Project SQL ................................. SUCCESS [1:23.676s] [INFO] Spark Project Hive ................................ SUCCESS [1:40.685s] [INFO] Spark Project REPL ................................ SUCCESS [47.597s] [INFO] Spark Project YARN Parent POM ..................... SUCCESS [1.770s] [INFO] Spark Project YARN Alpha API ...................... SUCCESS [39.965s] [INFO] Spark Project Assembly ............................ SUCCESS [34.985s] [INFO] Spark Project External Twitter .................... SUCCESS [22.567s] [INFO] Spark Project External Kafka ...................... SUCCESS [24.920s] [INFO] Spark Project External Flume ...................... SUCCESS [26.224s] [INFO] Spark Project External ZeroMQ ..................... SUCCESS [24.192s] [INFO] Spark Project External MQTT ....................... SUCCESS [21.597s] [INFO] Spark Project Examples ............................ SUCCESS [1:03.254s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 18:38.366s [INFO] Finished at: Mon Jul 07 12:56:23 JST 2014 [INFO] Final Memory: 57M/903M [INFO] ------------------------------------------------------------------------ real 18m40.737s user 44m34.005s sys 0m24.722s
くらいで無事ビルド終了.
HadoopのYARN利用で実行してみました.
> ./bin/spark-shell --master yarn-client Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/07/07 13:00:37 INFO spark.SecurityManager: Changing view acls to: hadoop 14/07/07 13:00:37 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop) 14/07/07 13:00:37 INFO spark.HttpServer: Starting HTTP Server 14/07/07 13:00:37 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/07 13:00:37 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:43672 Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45) scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@4a984a6f scala> import hiveContext._ import hiveContext._ scala> hql("SHOW TABLES") ※(大量のエラーログ) 14/07/07 13:03:38 ERROR hive.HiveContext: ====================== HIVE FAILURE OUTPUT ====================== FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ====================== END HIVE FAILURE OUTPUT ====================== org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
よく分からなかったのですが,設定が足りないのかと思ってまたドキュメントを読んでみました.
Configuration of Hive is done by placing your hive-site.xml file in conf/.
Hiveのインストールディレクトリと別に,Sparkのインストールディレクトリのconf以下にhive-site.xmlが必要...?
トライ.
> cp -pv /usr/local/hive/conf/hive-site.xml conf/ `/usr/local/hive/conf/hive-site.xml' -> `conf/hive-site.xml'
結果的には,変わらず...
※ただし,これは必要な設定でした
scala> hql("SHOW TABLES") 14/07/07 13:28:00 INFO parse.ParseDriver: Parsing command: SHOW TABLES 14/07/07 13:28:00 INFO parse.ParseDriver: Parse Completed (中略) 14/07/07 13:28:01 ERROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient (中略) Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BoneCP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver. (中略) 14/07/07 13:28:01 ERROR hive.HiveContext: ====================== HIVE FAILURE OUTPUT ====================== FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ====================== END HIVE FAILURE OUTPUT ======================
mysql-connectorのjarの場所が分かってないね...ということで,それらしいオプションを指定してみました.
> ./bin/spark-shell --master yarn-client --jars /usr/local/hive/lib/mysql-connector-java.jar Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/07/07 16:24:09 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:42755 Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45) Type in expressions to have them evaluated. Type :help for more information. (中略) 14/07/07 16:24:16 INFO spark.SparkContext: Added JAR file:/usr/local/hive/lib/mysql-connector-java.jar at http://192.168.26.25:55431/jars/mysql-connector-java.jar with timestamp 140471 7856329 scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) (中略) hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@ea51a00 scala> import hiveContext._ scala> hql("SHOW TABLES") Caused by: java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://mymetastoreserver/metastore?createDatabaseIfNotExist=true, username = hive. Terminating connection pool. Original Exception: ------ java.sql.SQLException: No suitable driver found for jdbc:mysql://mynamenode/metastore?createDatabaseIfNotExist=true
これは訳が分かりませんでした.JDBC urlの内容をいろいろ変えてもダメ.
最終的な回答.
> ./bin/spark-shell --master yarn-client --jars /usr/local/hive/lib/mysql-connector-java.jar
じゃなくて,
> ./bin/spark-shell --master yarn-client --driver-class-path /usr/local/hive/lib/mysql-connector-java.jar
が正解.
yarn-clientモードの時は,コマンドを叩いたマシンでドライバが実行されるので,--driver-class-pathでjavaに直接jarの場所を教えてあげないとダメなようです.yarn-clusterモードの時は,--jarsで共有する感じで.