Sqoopを入れて,TSV形式ファイルとMySQLからHiveに入れるまでの復習

まずはSqoopから置きます.プロンプトが#はroot,$は一般ユーザです.

# tar xf sqoop-1.4.3-cdh4.5.0.tar.gz -C /usr/local/
# chown -R hadoop:hadoop /usr/local/sqoop-1.4.3-cdh4.5.0/
# ln -s /usr/local/sqoop-1.4.3-cdh4.5.0 /usr/local/sqoop
# emacs /etc/bashrc

export PATH=$PATH:/usr/local/sqoop/bin

 #おっと.怒られた.
$ sqoop help
Error: /usr/lib/hadoop does not exist!
Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation.

# emacs /etc/bashrc
export HADOOP_COMMON_HOME=/usr/local/hadoop

$ sudo su hadoop
$ sqoop help
14/04/30 16:06:50 INFO sqoop.Sqoop: Running Sqoop version: 1.4.3-cdh4.5.0
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

イケタ.

次にHiveを置く.ちょっと変更があったのをのちほど忘れてなければ書きます.

# wget http://archive.cloudera.com/cdh5/cdh/5/hive-0.12.0-cdh5.0.0.tar.gz
# tar xf hive-0.12.0-cdh5.0.0.tar.gz -C /usr/local/
# chown -R hadoop:hadoop /usr/local/hive-0.12.0-cdh5.0.0/
# rm /usr/local/hive
# ln -s /usr/local/hive-0.12.0-cdh5.0.0/ /usr/local/hive

# emacs /etc/bashrc

export PATH=$PATH:/usr/local/sqoop/bin:/usr/local/hive/bin


# ls -l /usr/local/hive/bin/schematool 
-rwxr-xr-x. 1 hadoop hadoop 884 Mar 28 15:14 /usr/local/hive/bin/schematool

$ cp -pv /usr/local/hive/conf/hive-default.xml.template /usr/local/hive/conf/hive-site.xml
`/usr/local/hive/conf/hive-default.xml.template' -> `/usr/local/hive/conf/hive-site.xml'

 #このへんはお好みで.
<property>
  <name>hive.cli.print.header</name>
<!--
  <value>false</value>
-->
  <value>true</value>
  <description>Whether to print the names of the columns in query output.</description>
</property>

<property>
  <name>hive.cli.print.current.db</name>
<!--
  <value>false</value>
-->
  <value>true</value>
  <description>Whether to include the current database in the hive prompt.</description>
</property>

 #ローカルモードなのでこれは変えない.
<property>
  <name>hive.metastore.uris</name>
  <value></value>
  <description>Thrift uri for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>

 #こちらを変える.メタストア用のMySQLを立てて,そのアドレスを記載する
<property>
  <name>javax.jdo.option.ConnectionURL</name>
<!--
  <value>jdbc:derby:;databaseName=metastore_db;create=true</value>
-->
  <value>jdbc:mysql://metastore-server/metastore?createDatabaseIfNotExist=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

 #コレも変える.メタストアの種類に合わせるだけ
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
<!--
  <value>org.apache.derby.jdbc.EmbeddedDriver</value>
-->
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>

 #この辺りはMySQLに設定するものに合わせる
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
<!--
  <value>APP</value>
-->
  <value>hive_user</value>
  <description>username to use against metastore database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
<!--
  <value>mine</value>
-->
  <value>hive_password</value>
  <description>password to use against metastore database</description>
</property>

<property>
  <name>hive.exec.compress.output</name>
<!--
  <value>false</value>
-->
  <value>true</value>
  <description> This controls whether the final outputs of a query (to a local/hdfs file or a hive table) is compressed. The compression codec and other options are determined from ha\
doop config variables mapred.output.compress* </description>
</property>

<property>
  <name>hive.exec.compress.intermediate</name>
<!--
  <value>false</value>
-->
  <value>true</value>
  <description> This controls whether intermediate files produced by hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from \
hadoop config variables mapred.output.compress* </description>
</property>

<property>
  <name>hive.exec.parallel</name>
<!--
  <value>false</value>
-->
  <value>true</value>
  <description>Whether to execute jobs in parallel</description>
</property>

 #ダイナミックパーティションを自由に作られるよう設定
<property>
  <name>hive.exec.dynamic.partition.mode</name>
<!--
  <value>strict</value>
-->
  <value>nonstrict</value>
  <description>In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.</description>
</property>

次に,MySQLを設定します.

# wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.30.tar.gz
# tar xvf mysql-connector-java-5.1.30.tar.gz mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar
# cp -pv mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar /usr/local/hive/lib/
`mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar' -> `/usr/local/hive/lib/mysql-connector-java-5.1.30-bin.jar'
# rm -fR mysql-connector-java-5.1.30

 #最近の,例えばCDH5.0.0では,今までのようにMySQLにクエリを流し込んで初期化するのではなく,schematoolを使うことが推奨されているので,従ってみる
$ schematool -dbType mysql -info
Metastore connection URL:        jdbc:mysql://hive_server/metastore?createDatabaseIfNotExist=true
Metastore Connection Driver :    com.mysql.jdbc.Driver
Metastore connection User:       hive_user
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
*** schemaTool failed ***

 #エラーが出る...

$ mysql -u root --password=root_password -e "SELECT host, user FROM mysql.user WHERE user='hive_user'"
+-----------+-----------+
| host      | user      |
+-----------+-----------+
| %         | hive_user |
| localhost | hive_user |
+-----------+-----------+

$ schematool -verbose -dbType mysql -initSchema -dryRun
Metastore connection URL:        jdbc:mysql://metastore-server/metastore?createDatabaseIfNotExist=true
Metastore Connection Driver :    com.mysql.jdbc.Driver
Metastore connection User:       hive_user
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
        at org.apache.hive.beeline.HiveSchemaTool.getConnectionToMetastore(HiveSchemaTool.java:187)
        at org.apache.hive.beeline.HiveSchemaTool.testConnectionToMetastore(HiveSchemaTool.java:150)
        at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:279)
        at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:265)
        at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:512)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.sql.SQLException: Access denied for user 'savaway'@'savaway01.nhn-playart.com' (using password: YES)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1084)

$ mysql -h metastore-server -u hive_user --password=hive_password
ERROR 1045 (28000): Access denied for user 'hive_user'@'hive_server' (using password: YES)
$ mysql -h localhost -u hive --password=hive_password
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 30
Server version: 5.5.5-10.0.10-MariaDB Source distribution

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> quit

 #ああ.localhostでログインできるってことは,MySQLのドキュメントにも書いてるヤツです.

$ mysql -u root --password=MariaDB -e "SELECT host, user FROM mysql.user WHERE user=''"
+---------------------------+------+
| host                      | user |
+---------------------------+------+
| localhost                 |      |
| hive_server.example.com   |      |
+---------------------------+------+

 #パスワード無しユーザを全部消す
$ mysql -u root --password=MariaDB -e "DROP USER ''@'hive_server.example..com'"                                                                                 
$ mysql -u root --password=MariaDB -e "SELECT host, user FROM mysql.user WHERE user=''"
+-----------+------+
| host      | user |
+-----------+------+
| localhost |      |
+-----------+------+

 #イケタ
$ schematool -verbose -dbType mysql -initSchema -dryRun
Metastore connection URL:        jdbc:mysql://hive_server/metastore?createDatabaseIfNotExist=true
Metastore Connection Driver :    com.mysql.jdbc.Driver
Metastore connection User:       hive_user
Starting metastore schema initialization to 0.12.0
Initialization script hive-schema-0.12.0.mysql.sql
schemaTool completeted

 #メタストアの初期化もイケタ
$ schematool -verbose -dbType mysql -initSchema
Starting metastore schema initialization to 0.12.0
Initialization script hive-schema-0.12.0.mysql.sql
Connecting to jdbc:mysql://hive_server/metastore?createDatabaseIfNotExist=true
Connected to: MySQL (version 5.5.5-10.0.10-MariaDB)
Driver: MySQL Connector Java (version mysql-connector-java-5.1.30 ( Revision: alexander.soklakov@oracle.com-20140310090514-8xt1yoht5ksg2e7c ))
Transaction isolation: TRANSACTION_READ_COMMITTED
Beeline version 0.12.0-cdh5.0.0 by Apache Hive
0: jdbc:mysql://hive_server/metastore> !autocommit on
Autocommit status: true
0: jdbc:mysql://hive_server/metastore> /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */
No rows affected (0.002 seconds)
0: jdbc:mysql://hive_server/metastore> !closeall
Closing: com.mysql.jdbc.JDBC4Connection
beeline> Initialization script completed
schemaTool completeted

なんとなく出来たので,DBをつくります.

 #Hive_userのディレクトリがHDFSに無いことの確認
$ hdfs dfs -ls /user
drwxr-xr-x   - hadoop supergroup          0 2014-04-30 15:57 /user/hadoop

 #DBをHiveに作ると...
$ hive -e "CREATE DATABASE test"
OK
Time taken: 3.74 seconds

 #ディレクトリも勝手に作られる
$ hdfs dfs -ls /user/hive_user/warehouse
drwxr-xr-x   - hadoop supergroup          0 2014-04-30 18:19 /user/hive_user/warehouse/test.db

 #テーブルも作る.TSV形式から流し込めるように作る.
$ hive --database=test -e "CREATE TABLE users (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'"
OK
Time taken: 2.933 seconds
OK
Time taken: 0.939 seconds

 #TSVファイルの準備
$ echo -e "2\talice" > /tmp/users.tsv
$ echo -e "5\tbob" >> /tmp/users.tsv
$ cat /tmp/users.tsv 
2       alice
5       bob

 #TSVファイルから流し込む
$ hive --database=test -e "LOAD DATA LOCAL INPATH '/tmp/users.tsv' INTO TABLE users"
OK
Time taken: 2.842 seconds
Copying data from file:/tmp/users.tsv
Copying file: file:/tmp/users.tsv
Loading data to table test.users
Table test.users stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 14, raw_data_size: 0]
OK
Time taken: 1.354 seconds
 #確認オーケー
$ hive --database=test -e "SELECT * FROM users"
OK
Time taken: 3.113 seconds
OK
id      name
2       alice
5       bob
Time taken: 1.175 seconds, Fetched: 2 row(s)

 #次にSqoopでMySQLからHiveにテーブルを自動作成します
$ mysql -h metastore-server -u hive_user --password=hive_password test -e "CREATE TABLE users (id INT, name CHAR(10))"

$ mysql -u hive_user --password=hive_password test -e "SELECT COUNT(*) FROM users"
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+

 #ダミーデータを準備する
$ mysql -u hive_user --password=hive_password test -e "INSERT INTO users SET id=2, name='alice'"                                                                  
$ mysql -u hive_user --password=hive_password test -e "INSERT INTO users SET id=5, name='bob'"                                                                    
$ mysql -u hive_user --password=hive_password test -e "SELECT COUNT(*) FROM users"                                                                                
+----------+
| COUNT(*) |
+----------+
|        2 |
+----------+
 #確認
$ mysql -u hive_user --password=hive_password test -e "SELECT * FROM users"
+------+-------+
| id   | name  |
+------+-------+
|    2 | alice |
|    5 | bob   |
+------+-------+

まだHiveにもLOAD DATAしたのが残っています.
$ hive --database=test -e "SELECT * FROM users"
OK
Time taken: 2.974 seconds
OK
id      name
2       alice
5       bob
Time taken: 1.479 seconds, Fetched: 2 row(s)

$ hdfs dfs -ls /user/hive_user/warehouse/test.db/users
Found 1 items
-rwxr-xr-x   3 hadoop supergroup         14 2014-04-30 18:27 /user/hive_user/warehouse/test.db/users/users.tsv


$ hdfs dfs -cat /user/hive_user/warehouse/test.db/users/users.tsv                     
2       alice
5       bob

 #ファイルごと消します.
$ hdfs dfs -rm /user/hive_user/warehouse/test.db/users/users.tsv
Moved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/users.tsv' to trash at: hdfs://mycluster/user_user/hadoop/.Trash/Current


$ hive --database=test -e "SELECT * FROM users"
OK
Time taken: 3.181 seconds
OK
id      name
Time taken: 1.06 seconds

 #消えました.


cp -pv /usr/local/hive/lib/mysql-connector-java-5.1.30-bin.jar /usr/local/sqoop/lib/

 #スクープ!
sqoop import --connect "jdbc:mysql://mysql-server:3306/test?zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false" --username hive_server --password hive_password --query "SELECT * FROM users WHERE \$CONDITIONS" --split-by id --target-dir users --hive-table users --hive-database test --hive-import --hive-drop-import-delims --fields-terminated-by "\001"

14/05/02 12:47:44 INFO hive.HiveImport: OK
14/05/02 12:47:44 INFO hive.HiveImport: Time taken: 3.677 seconds
14/05/02 12:47:45 INFO hive.HiveImport: Loading data to table test.users
14/05/02 12:47:45 INFO hive.HiveImport: Table test.users stats: [num_partitions: 0, num_files: 5, num_rows: 0, total_size: 46, raw_data_size: 0]
14/05/02 12:47:45 INFO hive.HiveImport: OK
14/05/02 12:47:45 INFO hive.HiveImport: Time taken: 1.06 seconds
14/05/02 12:47:45 INFO hive.HiveImport: Hive import complete.
14/05/02 12:47:45 INFO hive.HiveImport: Export directory is empty, removing it.

$ hive --database=test -e "SELECT * FROM users"
OK
Time taken: 3.005 seconds
OK
id      name
NULL    NULL
NULL    NULL
Time taken: 1.176 seconds, Fetched: 2 row(s)

 #あれ?

$ hdfs dfs -text /user/hive_user/warehouse/test.db/users/part-m-00000.deflate
2alice

 #値は入っている.

 #あ,テーブル作る時にこうやったからですね.デリミタ.
$ hive --database=test -e "CREATE TABLE users (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'"


 #せっかくSqoopで,デフォルトの\001にしているので,変更します.
$ hive --database=test -e "ALTER TABLE users SET SERDEPROPERTIES ('field.delim' = '\001')"
OK
Time taken: 3.158 seconds
OK
Time taken: 0.391 seconds

またファイルごと消して...

$ hdfs dfs -rm /user/hive_user/warehouse/test.db/users/*
Moved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/_SUCCESS' to trash at: hdfs://mycluster/user/hadoop/.Trash/Current
Moved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/part-m-00000.deflate' to trash at: hdfs://mycluster/user/hadoop/.Trash/CMoved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/part-m-00001.deflate' to trash at: hdfs://mycluster/user/hadoop/.Trash/CMoved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/part-m-00002.deflate' to trash at: hdfs://mycluster/user/hadoop/.Trash/Current
Moved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/part-m-00003.deflate' to trash at: hdfs://mycluster/user/hadoop/.Trash/Current


 #消えたのを確認して...

$ hive --database=test -e "SELECT * FROM users"
OK
Time taken: 3.097 seconds
OK
id      name
Time taken: 1.103 seconds


 #スクープ!
 $ sqoop import --connect "jdbc:mysql://mysql-server:3306/test?zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false" --username hive_user --password hive_password --query "SELECT * FROM users WHERE \$CONDITIONS" --split-by id --target-dir users --hive-table users --hive-database test --hive-import --hive-drop-import-delims --fields-terminated-by "\001"


14/05/02 13:08:37 INFO hive.HiveImport: OK
14/05/02 13:08:37 INFO hive.HiveImport: Time taken: 3.598 seconds
14/05/02 13:08:37 INFO hive.HiveImport: Loading data to table test.users
14/05/02 13:08:38 INFO hive.HiveImport: Table test.users stats: [num_partitions: 0, num_files: 5, num_rows: 0, total_size: 46, raw_data_size: 0]
14/05/02 13:08:38 INFO hive.HiveImport: OK
14/05/02 13:08:38 INFO hive.HiveImport: Time taken: 1.032 seconds
14/05/02 13:08:38 INFO hive.HiveImport: Hive import complete.
14/05/02 13:08:38 INFO hive.HiveImport: Export directory is empty, removing it.

 #確認します.やれやれ.テーブルも自動作成されて中身も入っています.
$ hive --database=test -e "SELECT * FROM users"
OK
Time taken: 3.245 seconds
OK
id      name
2       alice
5       bob
Time taken: 1.242 seconds, Fetched: 2 row(s)

ファイルはここに.

$ hdfs dfs -ls /user/hive_user/warehouse/test.db/users/
Found 5 items
-rwxr-xr-x   3 hadoop supergroup          0 2014-05-02 13:08 /user/hive_user/warehouse/test.db/users/_SUCCESS
-rwxr-xr-x   3 hadoop supergroup         16 2014-05-02 13:08 /user/hive_user/warehouse/test.db/users/part-m-00000.deflate
-rwxr-xr-x   3 hadoop supergroup          8 2014-05-02 13:08 /user/hive_user/warehouse/test.db/users/part-m-00001.deflate
-rwxr-xr-x   3 hadoop supergroup          8 2014-05-02 13:08 /user/hive_user/warehouse/test.db/users/part-m-00002.deflate
-rwxr-xr-x   3 hadoop supergroup         14 2014-05-02 13:08 /user/hive/warehouse/test.db/users/part-m-00003.deflate


 #こんなエラーが出た時は,HIVE_HOMEをちゃんと設定すること.
14/05/02 13:46:22 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Cannot run program "hive": error=2, No such file or directory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)