Hiveのパーティションとカラム追加にハマる
まずは普通にTSV形式のファイルから,HiveにLOAD DATAしようとしました.
#テスト用のデータベースを作る $ hive -e 'CREATE DATABASE test' OK Time taken: 3.668 seconds #テーブルが空であることの確認 $ hive -database=test -e 'SHOW TABLES' Time taken: 2.944 seconds OK tab_name Time taken: 0.557 seconds # 3つのカラムからなるテーブルを作成 $ hive -database=test -e "CREATE TABLE tmp_members (id INT, name STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'" OK Time taken: 3.021 seconds OK Time taken: 1.162 seconds #テーブルに流し込むTSVファイルを準備 $ cat /tmp/members.txt 1 bob US 2 carry EU 3 taro JP #流し込む $ hive -database=test -e "LOAD DATA LOCAL INPATH '/tmp/members.txt' INTO TABLE tmp_members" opying data from file:/tmp/members.txt Copying file: file:/tmp/members.txt Loading data to table test.tmp_members Table test.tmp_members stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 30, raw_data_size: 0] OK Time taken: 1.619 seconds #問題なく格納されている $ hive -database=test -e "SELECT id, name, country FROM tmp_members" Job 0: Map: 1 Cumulative CPU: 1.17 sec HDFS Read: 259 HDFS Write: 30 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 170 msec OK id name country 1 bob US 2 carry EU 3 taro JP Time taken: 14.071 seconds #タブ区切りじゃないDBを作る $ hive -database=test -e "CREATE TABLE members (id INT, name STRING) PARTITIONED BY(country STRING) OK #空であることの確認 $ hive -database=test -e "SELECT id, name, country FROM members" MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 0.88 sec HDFS Read: 278 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 880 msec OK id name country Time taken: 14.242 seconds #TSV区切りのテーブルからバイナリ区切り・パーティション付きのテーブルに移動 $ hive -database=test -e "INSERT OVERWRITE TABLE members PARTITION(country) SELECT id, name, country FROM tmp_members" 3 Rows loaded to members MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.27 sec HDFS Read: 259 HDFS Write: 21 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 270 msec OK id name country Time taken: 15.233 seconds #問題なく格納されていることの確認 $ hive -database=test -e "SELECT id, name, country FROM members" MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.18 sec HDFS Read: 446 HDFS Write: 30 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 180 msec OK id name country 2 carry EU 3 taro JP 1 bob US Time taken: 14.454 seconds #パーティションが出来ていることの確認 $ hdfs dfs -ls /user/hive/warehouse/test.db/members Found 3 items drwxr-xr-x - hadoop supergroup 0 2014-04-28 13:35 /user/hive/warehouse/test.db/members/country=EU drwxr-xr-x - hadoop supergroup 0 2014-04-28 13:35 /user/hive/warehouse/test.db/members/country=JP drwxr-xr-x - hadoop supergroup 0 2014-04-28 13:35 /user/hive/warehouse/test.db/members/country=US #TSV形式の元テーブルではパーティションが無いことの確認 $ hdfs dfs -ls /user/hive/warehouse/test.db/tmp_members Found 1 items -rw-r--r-- 3 hadoop supergroup 30 2014-04-28 13:27 /user/hive/warehouse/test.db/tmp_members/members.txt #一度,パーティションを全て削除する $ hdfs dfs -rm -r /user/hive/warehouse/test.db/members/* Deleted /user/hive/warehouse/test.db/members/country=EU Deleted /user/hive/warehouse/test.db/members/country=JP Deleted /user/hive/warehouse/test.db/members/country=US
↑ここがウカツでした.
#TSV形式のテーブルも一度削除 $ hdfs dfs -rm -r /user/hive/warehouse/test.db/tmp_members/* Deleted /user/hive/warehouse/test.db/tmp_members/members.txt #中身を消したあと,カラムを一つ追加する $ hive -database=test -e "ALTER TABLE tmp_members ADD COLUMNS (age INT)" OK Time taken: 3.027 seconds OK Time taken: 0.545 seconds #新たに流し込むデータを確認 $ cat /tmp/members.txt 1 bob US 38 2 carry EU 20 3 taro JP 26 #流し込む $ hive -database=test -e "LOAD DATA LOCAL INPATH '/tmp/members.txt' INTO TABLE tmp_members" Copying data from file:/tmp/members.txt Copying file: file:/tmp/members.txt Loading data to table test.tmp_members Table test.tmp_members stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 39, raw_data_size: 0] OK Time taken: 1.613 seconds #問題なく格納されていることの確認 $ hive -database=test -e "SELECT id, name, country, age FROM tmp_members" MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.19 sec HDFS Read: 268 HDFS Write: 39 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 190 msec OK id name country age 1 bob US 38 2 carry EU 20 3 taro JP 26 Time taken: 14.219 seconds
次,パーティション付きテーブルの方です.
#カラムを追加 $ hive -database=test -e "ALTER TABLE members ADD COLUMNS (age INT)" OK Time taken: 2.971 seconds OK Time taken: 0.517 seconds #TSV形式テーブルから移動 $ hive -database=test -e "INSERT OVERWRITE TABLE members PARTITION(country) SELECT id, name, age, country FROM tmp_members" MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.26 sec HDFS Read: 268 HDFS Write: 30 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 260 msec OK id name age country Time taken: 15.493 seconds #カラム追加前と全く同じ形式のクエリです. #$ hive -database=test -e "INSERT OVERWRITE TABLE members PARTITION(country) SELECT id, name, country FROM tmp_members" #$ hive -database=test -e "INSERT OVERWRITE TABLE members PARTITION(country) SELECT id, name, age, country FROM tmp_members" #新しいカラムが入っていません $ hive -database=test -e "SELECT id, name, country, age FROM members" MapReduce Jobs Launched: Job 0: Map: 2 Cumulative CPU: 2.87 sec HDFS Read: 1015 HDFS Write: 39 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 870 msec OK id name country age 2 carry EU NULL 3 taro JP NULL 1 bob US NULL Time taken: 15.678 seconds
ちょっと謎でした.
ファイルの中を見ると,追加したカラムageの値「20」が入っています.
$ hdfs dfs -cat /user/hive/warehouse/test.db/members/country=EU/000000_0 2carry20
なので,表示されていないか,hiveコマンドが認識していないかだと思いました.
MySQLのmetaデータを見てみました.
#カラムageのIDは261/266 mysql> select * from COLUMNS_V2 where COLUMN_NAME='age'\g +-------+---------+-------------+-----------+-------------+ | CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX | +-------+---------+-------------+-----------+-------------+ | 261 | NULL | age | int | 3 | | 266 | NULL | age | int | 2 | +-------+---------+-------------+-----------+-------------+ 2 rows in set (0.00 sec) mysql> select LOCATION from SDS where CD_ID=261 or CD_ID=266\g #tmp_members/membersにageは認識されている +------------------------------------------------------------------------+ | LOCATION | +------------------------------------------------------------------------+ | hdfs://WTUALAP1503:8020/user/hive/warehouse/test.db/tmp_members | | hdfs://WTUALAP1503:8020/user/hive/warehouse/test.db/members | +------------------------------------------------------------------------+ 2 rows in set (0.00 sec)
ん?おかしいおかしい.test.db/members/country=EUみたいなパーティションの入ったlocationにもageが認識されていないとなりません.
#もっかいパーティションを消す $ hdfs dfs -rm -r /user/hive/warehouse/test.db/members/* Deleted /user/hive/warehouse/test.db/members/country=EU Deleted /user/hive/warehouse/test.db/members/country=JP Deleted /user/hive/warehouse/test.db/members/country=US #うお.やっぱりmetastoreに情報が残っていて消えていない $ hive -database=test -e "SHOW PARTITIONS members" OK Time taken: 2.941 seconds OK partition country=EU country=JP country=US Time taken: 0.754 seconds #マジメに消す $ hive -database=test -e "ALTER TABLE members DROP PARTITION (country='EU'), PARTITION (country='JP'), PARTITION (country='US')" OK Time taken: 3.053 seconds Dropping the partition country=EU Dropping the partition country=JP Dropping the partition country=US OK Time taken: 1.78 seconds
改めてDBに入れます.
#さっきと同じクエリ $ hive -database=test -e "INSERT OVERWRITE TABLE members PARTITION(country) SELECT id, name, age, country FROM tmp_members" MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.25 sec HDFS Read: 268 HDFS Write: 30 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 250 msec OK id name age country Time taken: 15.204 seconds #追加したカラムageに値が入った $ hive -database=test -e "SELECT id, name, country, age FROM members" MapReduce Jobs Launched: Job 0: Map: 2 Cumulative CPU: 2.85 sec HDFS Read: 1015 HDFS Write: 39 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 850 msec OK id name country age 2 carry EU 20 3 taro JP 26 1 bob US 38 Time taken: 15.763 seconds