Sqoopを入れて,TSV形式ファイルとMySQLからHiveに入れるまでの復習
まずはSqoopから置きます.プロンプトが#はroot,$は一般ユーザです.
# tar xf sqoop-1.4.3-cdh4.5.0.tar.gz -C /usr/local/ # chown -R hadoop:hadoop /usr/local/sqoop-1.4.3-cdh4.5.0/ # ln -s /usr/local/sqoop-1.4.3-cdh4.5.0 /usr/local/sqoop # emacs /etc/bashrc export PATH=$PATH:/usr/local/sqoop/bin #おっと.怒られた. $ sqoop help Error: /usr/lib/hadoop does not exist! Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation. # emacs /etc/bashrc export HADOOP_COMMON_HOME=/usr/local/hadoop $ sudo su hadoop $ sqoop help 14/04/30 16:06:50 INFO sqoop.Sqoop: Running Sqoop version: 1.4.3-cdh4.5.0 usage: sqoop COMMAND [ARGS] Available commands: codegen Generate code to interact with database records create-hive-table Import a table definition into Hive eval Evaluate a SQL statement and display the results export Export an HDFS directory to a database table help List available commands import Import a table from a database to HDFS import-all-tables Import tables from a database to HDFS job Work with saved jobs list-databases List available databases on a server list-tables List available tables in a database merge Merge results of incremental imports metastore Run a standalone Sqoop metastore version Display version information イケタ.
次にHiveを置く.ちょっと変更があったのをのちほど忘れてなければ書きます.
# wget http://archive.cloudera.com/cdh5/cdh/5/hive-0.12.0-cdh5.0.0.tar.gz # tar xf hive-0.12.0-cdh5.0.0.tar.gz -C /usr/local/ # chown -R hadoop:hadoop /usr/local/hive-0.12.0-cdh5.0.0/ # rm /usr/local/hive # ln -s /usr/local/hive-0.12.0-cdh5.0.0/ /usr/local/hive # emacs /etc/bashrc export PATH=$PATH:/usr/local/sqoop/bin:/usr/local/hive/bin # ls -l /usr/local/hive/bin/schematool -rwxr-xr-x. 1 hadoop hadoop 884 Mar 28 15:14 /usr/local/hive/bin/schematool $ cp -pv /usr/local/hive/conf/hive-default.xml.template /usr/local/hive/conf/hive-site.xml `/usr/local/hive/conf/hive-default.xml.template' -> `/usr/local/hive/conf/hive-site.xml' #このへんはお好みで. <property> <name>hive.cli.print.header</name> <!-- <value>false</value> --> <value>true</value> <description>Whether to print the names of the columns in query output.</description> </property> <property> <name>hive.cli.print.current.db</name> <!-- <value>false</value> --> <value>true</value> <description>Whether to include the current database in the hive prompt.</description> </property> #ローカルモードなのでこれは変えない. <property> <name>hive.metastore.uris</name> <value></value> <description>Thrift uri for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property> #こちらを変える.メタストア用のMySQLを立てて,そのアドレスを記載する <property> <name>javax.jdo.option.ConnectionURL</name> <!-- <value>jdbc:derby:;databaseName=metastore_db;create=true</value> --> <value>jdbc:mysql://metastore-server/metastore?createDatabaseIfNotExist=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> #コレも変える.メタストアの種類に合わせるだけ <property> <name>javax.jdo.option.ConnectionDriverName</name> <!-- <value>org.apache.derby.jdbc.EmbeddedDriver</value> --> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> #この辺りはMySQLに設定するものに合わせる <property> <name>javax.jdo.option.ConnectionUserName</name> <!-- <value>APP</value> --> <value>hive_user</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <!-- <value>mine</value> --> <value>hive_password</value> <description>password to use against metastore database</description> </property> <property> <name>hive.exec.compress.output</name> <!-- <value>false</value> --> <value>true</value> <description> This controls whether the final outputs of a query (to a local/hdfs file or a hive table) is compressed. The compression codec and other options are determined from ha\ doop config variables mapred.output.compress* </description> </property> <property> <name>hive.exec.compress.intermediate</name> <!-- <value>false</value> --> <value>true</value> <description> This controls whether intermediate files produced by hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from \ hadoop config variables mapred.output.compress* </description> </property> <property> <name>hive.exec.parallel</name> <!-- <value>false</value> --> <value>true</value> <description>Whether to execute jobs in parallel</description> </property> #ダイナミックパーティションを自由に作られるよう設定 <property> <name>hive.exec.dynamic.partition.mode</name> <!-- <value>strict</value> --> <value>nonstrict</value> <description>In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.</description> </property>
次に,MySQLを設定します.
# wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.30.tar.gz # tar xvf mysql-connector-java-5.1.30.tar.gz mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar # cp -pv mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar /usr/local/hive/lib/ `mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar' -> `/usr/local/hive/lib/mysql-connector-java-5.1.30-bin.jar' # rm -fR mysql-connector-java-5.1.30 #最近の,例えばCDH5.0.0では,今までのようにMySQLにクエリを流し込んで初期化するのではなく,schematoolを使うことが推奨されているので,従ってみる $ schematool -dbType mysql -info Metastore connection URL: jdbc:mysql://hive_server/metastore?createDatabaseIfNotExist=true Metastore Connection Driver : com.mysql.jdbc.Driver Metastore connection User: hive_user org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version. *** schemaTool failed *** #エラーが出る... $ mysql -u root --password=root_password -e "SELECT host, user FROM mysql.user WHERE user='hive_user'" +-----------+-----------+ | host | user | +-----------+-----------+ | % | hive_user | | localhost | hive_user | +-----------+-----------+ $ schematool -verbose -dbType mysql -initSchema -dryRun Metastore connection URL: jdbc:mysql://metastore-server/metastore?createDatabaseIfNotExist=true Metastore Connection Driver : com.mysql.jdbc.Driver Metastore connection User: hive_user org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version. org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version. at org.apache.hive.beeline.HiveSchemaTool.getConnectionToMetastore(HiveSchemaTool.java:187) at org.apache.hive.beeline.HiveSchemaTool.testConnectionToMetastore(HiveSchemaTool.java:150) at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:279) at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:265) at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:512) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Caused by: java.sql.SQLException: Access denied for user 'savaway'@'savaway01.nhn-playart.com' (using password: YES) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1084) $ mysql -h metastore-server -u hive_user --password=hive_password ERROR 1045 (28000): Access denied for user 'hive_user'@'hive_server' (using password: YES) $ mysql -h localhost -u hive --password=hive_password Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 30 Server version: 5.5.5-10.0.10-MariaDB Source distribution Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> quit #ああ.localhostでログインできるってことは,MySQLのドキュメントにも書いてるヤツです. $ mysql -u root --password=MariaDB -e "SELECT host, user FROM mysql.user WHERE user=''" +---------------------------+------+ | host | user | +---------------------------+------+ | localhost | | | hive_server.example.com | | +---------------------------+------+ #パスワード無しユーザを全部消す $ mysql -u root --password=MariaDB -e "DROP USER ''@'hive_server.example..com'" $ mysql -u root --password=MariaDB -e "SELECT host, user FROM mysql.user WHERE user=''" +-----------+------+ | host | user | +-----------+------+ | localhost | | +-----------+------+ #イケタ $ schematool -verbose -dbType mysql -initSchema -dryRun Metastore connection URL: jdbc:mysql://hive_server/metastore?createDatabaseIfNotExist=true Metastore Connection Driver : com.mysql.jdbc.Driver Metastore connection User: hive_user Starting metastore schema initialization to 0.12.0 Initialization script hive-schema-0.12.0.mysql.sql schemaTool completeted #メタストアの初期化もイケタ $ schematool -verbose -dbType mysql -initSchema Starting metastore schema initialization to 0.12.0 Initialization script hive-schema-0.12.0.mysql.sql Connecting to jdbc:mysql://hive_server/metastore?createDatabaseIfNotExist=true Connected to: MySQL (version 5.5.5-10.0.10-MariaDB) Driver: MySQL Connector Java (version mysql-connector-java-5.1.30 ( Revision: alexander.soklakov@oracle.com-20140310090514-8xt1yoht5ksg2e7c )) Transaction isolation: TRANSACTION_READ_COMMITTED Beeline version 0.12.0-cdh5.0.0 by Apache Hive 0: jdbc:mysql://hive_server/metastore> !autocommit on Autocommit status: true 0: jdbc:mysql://hive_server/metastore> /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */ No rows affected (0.002 seconds) 0: jdbc:mysql://hive_server/metastore> !closeall Closing: com.mysql.jdbc.JDBC4Connection beeline> Initialization script completed schemaTool completeted
なんとなく出来たので,DBをつくります.
#Hive_userのディレクトリがHDFSに無いことの確認 $ hdfs dfs -ls /user drwxr-xr-x - hadoop supergroup 0 2014-04-30 15:57 /user/hadoop #DBをHiveに作ると... $ hive -e "CREATE DATABASE test" OK Time taken: 3.74 seconds #ディレクトリも勝手に作られる $ hdfs dfs -ls /user/hive_user/warehouse drwxr-xr-x - hadoop supergroup 0 2014-04-30 18:19 /user/hive_user/warehouse/test.db #テーブルも作る.TSV形式から流し込めるように作る. $ hive --database=test -e "CREATE TABLE users (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'" OK Time taken: 2.933 seconds OK Time taken: 0.939 seconds #TSVファイルの準備 $ echo -e "2\talice" > /tmp/users.tsv $ echo -e "5\tbob" >> /tmp/users.tsv $ cat /tmp/users.tsv 2 alice 5 bob #TSVファイルから流し込む $ hive --database=test -e "LOAD DATA LOCAL INPATH '/tmp/users.tsv' INTO TABLE users" OK Time taken: 2.842 seconds Copying data from file:/tmp/users.tsv Copying file: file:/tmp/users.tsv Loading data to table test.users Table test.users stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 14, raw_data_size: 0] OK Time taken: 1.354 seconds #確認オーケー $ hive --database=test -e "SELECT * FROM users" OK Time taken: 3.113 seconds OK id name 2 alice 5 bob Time taken: 1.175 seconds, Fetched: 2 row(s) #次にSqoopでMySQLからHiveにテーブルを自動作成します $ mysql -h metastore-server -u hive_user --password=hive_password test -e "CREATE TABLE users (id INT, name CHAR(10))" $ mysql -u hive_user --password=hive_password test -e "SELECT COUNT(*) FROM users" +----------+ | COUNT(*) | +----------+ | 0 | +----------+ #ダミーデータを準備する $ mysql -u hive_user --password=hive_password test -e "INSERT INTO users SET id=2, name='alice'" $ mysql -u hive_user --password=hive_password test -e "INSERT INTO users SET id=5, name='bob'" $ mysql -u hive_user --password=hive_password test -e "SELECT COUNT(*) FROM users" +----------+ | COUNT(*) | +----------+ | 2 | +----------+ #確認 $ mysql -u hive_user --password=hive_password test -e "SELECT * FROM users" +------+-------+ | id | name | +------+-------+ | 2 | alice | | 5 | bob | +------+-------+ まだHiveにもLOAD DATAしたのが残っています. $ hive --database=test -e "SELECT * FROM users" OK Time taken: 2.974 seconds OK id name 2 alice 5 bob Time taken: 1.479 seconds, Fetched: 2 row(s) $ hdfs dfs -ls /user/hive_user/warehouse/test.db/users Found 1 items -rwxr-xr-x 3 hadoop supergroup 14 2014-04-30 18:27 /user/hive_user/warehouse/test.db/users/users.tsv $ hdfs dfs -cat /user/hive_user/warehouse/test.db/users/users.tsv 2 alice 5 bob #ファイルごと消します. $ hdfs dfs -rm /user/hive_user/warehouse/test.db/users/users.tsv Moved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/users.tsv' to trash at: hdfs://mycluster/user_user/hadoop/.Trash/Current $ hive --database=test -e "SELECT * FROM users" OK Time taken: 3.181 seconds OK id name Time taken: 1.06 seconds #消えました. cp -pv /usr/local/hive/lib/mysql-connector-java-5.1.30-bin.jar /usr/local/sqoop/lib/ #スクープ! sqoop import --connect "jdbc:mysql://mysql-server:3306/test?zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false" --username hive_server --password hive_password --query "SELECT * FROM users WHERE \$CONDITIONS" --split-by id --target-dir users --hive-table users --hive-database test --hive-import --hive-drop-import-delims --fields-terminated-by "\001" 14/05/02 12:47:44 INFO hive.HiveImport: OK 14/05/02 12:47:44 INFO hive.HiveImport: Time taken: 3.677 seconds 14/05/02 12:47:45 INFO hive.HiveImport: Loading data to table test.users 14/05/02 12:47:45 INFO hive.HiveImport: Table test.users stats: [num_partitions: 0, num_files: 5, num_rows: 0, total_size: 46, raw_data_size: 0] 14/05/02 12:47:45 INFO hive.HiveImport: OK 14/05/02 12:47:45 INFO hive.HiveImport: Time taken: 1.06 seconds 14/05/02 12:47:45 INFO hive.HiveImport: Hive import complete. 14/05/02 12:47:45 INFO hive.HiveImport: Export directory is empty, removing it. $ hive --database=test -e "SELECT * FROM users" OK Time taken: 3.005 seconds OK id name NULL NULL NULL NULL Time taken: 1.176 seconds, Fetched: 2 row(s) #あれ? $ hdfs dfs -text /user/hive_user/warehouse/test.db/users/part-m-00000.deflate 2alice #値は入っている. #あ,テーブル作る時にこうやったからですね.デリミタ. $ hive --database=test -e "CREATE TABLE users (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'" #せっかくSqoopで,デフォルトの\001にしているので,変更します. $ hive --database=test -e "ALTER TABLE users SET SERDEPROPERTIES ('field.delim' = '\001')" OK Time taken: 3.158 seconds OK Time taken: 0.391 seconds またファイルごと消して... $ hdfs dfs -rm /user/hive_user/warehouse/test.db/users/* Moved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/_SUCCESS' to trash at: hdfs://mycluster/user/hadoop/.Trash/Current Moved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/part-m-00000.deflate' to trash at: hdfs://mycluster/user/hadoop/.Trash/CMoved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/part-m-00001.deflate' to trash at: hdfs://mycluster/user/hadoop/.Trash/CMoved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/part-m-00002.deflate' to trash at: hdfs://mycluster/user/hadoop/.Trash/Current Moved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/part-m-00003.deflate' to trash at: hdfs://mycluster/user/hadoop/.Trash/Current #消えたのを確認して... $ hive --database=test -e "SELECT * FROM users" OK Time taken: 3.097 seconds OK id name Time taken: 1.103 seconds #スクープ! $ sqoop import --connect "jdbc:mysql://mysql-server:3306/test?zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false" --username hive_user --password hive_password --query "SELECT * FROM users WHERE \$CONDITIONS" --split-by id --target-dir users --hive-table users --hive-database test --hive-import --hive-drop-import-delims --fields-terminated-by "\001" 14/05/02 13:08:37 INFO hive.HiveImport: OK 14/05/02 13:08:37 INFO hive.HiveImport: Time taken: 3.598 seconds 14/05/02 13:08:37 INFO hive.HiveImport: Loading data to table test.users 14/05/02 13:08:38 INFO hive.HiveImport: Table test.users stats: [num_partitions: 0, num_files: 5, num_rows: 0, total_size: 46, raw_data_size: 0] 14/05/02 13:08:38 INFO hive.HiveImport: OK 14/05/02 13:08:38 INFO hive.HiveImport: Time taken: 1.032 seconds 14/05/02 13:08:38 INFO hive.HiveImport: Hive import complete. 14/05/02 13:08:38 INFO hive.HiveImport: Export directory is empty, removing it. #確認します.やれやれ.テーブルも自動作成されて中身も入っています. $ hive --database=test -e "SELECT * FROM users" OK Time taken: 3.245 seconds OK id name 2 alice 5 bob Time taken: 1.242 seconds, Fetched: 2 row(s) ファイルはここに. $ hdfs dfs -ls /user/hive_user/warehouse/test.db/users/ Found 5 items -rwxr-xr-x 3 hadoop supergroup 0 2014-05-02 13:08 /user/hive_user/warehouse/test.db/users/_SUCCESS -rwxr-xr-x 3 hadoop supergroup 16 2014-05-02 13:08 /user/hive_user/warehouse/test.db/users/part-m-00000.deflate -rwxr-xr-x 3 hadoop supergroup 8 2014-05-02 13:08 /user/hive_user/warehouse/test.db/users/part-m-00001.deflate -rwxr-xr-x 3 hadoop supergroup 8 2014-05-02 13:08 /user/hive_user/warehouse/test.db/users/part-m-00002.deflate -rwxr-xr-x 3 hadoop supergroup 14 2014-05-02 13:08 /user/hive/warehouse/test.db/users/part-m-00003.deflate #こんなエラーが出た時は,HIVE_HOMEをちゃんと設定すること. 14/05/02 13:46:22 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Cannot run program "hive": error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)