HDFS無しのHadoopクラスタを構築しようとして失敗した話

core-site.xmlを変更してみた.

<property>
  <name>fs.defaultFS</name>
<!--
  <value>file:///</value>
-->
  <value>s3n://XXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYY@mybacket-01</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

あと,プロパティのキーも変な気がしたので変えてみました.

<property>
<!--
  <name>fs.s3.impl</name>
-->
  <name>fs.AbstractFileSystem.s3.impl</name>
  <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
  <description>The FileSystem for s3: uris.</description>
</property>

<property>
<!--
  <name>fs.s3n.impl</name>
-->
  <name>fs.AbstractFileSystem.s3n.impl</name>
  <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
  <description>The FileSystem for s3n: (Native S3) uris.</description>
</property>

hdfsコマンドは普通に動きます.

hdfs dfs -mkdir input
hdfs dfs -ls 
14/04/02 02:17:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxrwxrwx   -          0 1970-01-01 00:00 input

exampleが動かないですねぇ.

$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.5.0.jar pi 2 2
Number of Maps  = 2
Samples per Map = 2
14/04/03 19:10:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wrote input for Map #0
Wrote input for Map #1
Starting Job
14/04/03 19:10:13 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/04/03 19:10:13 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
14/04/03 19:10:13 INFO mapreduce.Cluster: Failed to use org.apache.hadoop.mapred.YarnClientProtocolProvider due to error: java.lang.NoSuchMethodException: org.apache.hadoop.fs.s3.S3FileSystem.<init>(java.net.URI, org.apache.hadoop.conf.Configuration)
14/04/03 19:10:13 ERROR security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
        at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:122)
        at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:84)
        at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:77)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1239)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1235)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.mapreduce.Job.connect(Job.java:1234)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1263)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1287)
        at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
        at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:351)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:360)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
        at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

java.lang.NoSuchMethodException: org.apache.hadoop.fs.s3.S3FileSystemと言われても,ソースを見てもjarを見てもあるんですよねぇ.

Hiveを実行してみます.

$ hive -e 'show databases'
Logging initialized using configuration in jar:file:/usr/local/hive-0.10.0-cdh4.5.0/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_78148c99-e93f-4634-b292-20a2e73904cd_398905190.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.0.0-cdh4.5.0/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hive-0.10.0-cdh4.5.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
OK
database_name
default

$ hive -e "create database user_db"
Logging initialized using configuration in jar:file:/usr/local/hive-0.10.0-cdh4.5.0/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_ef5dc454-70ec-4dc0-b444-0ba8de46160a_1810542074.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.0.0-cdh4.5.0/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hive-0.10.0-cdh4.5.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
OK
Time taken: 5.315 seconds

$ hive -e "create table user_db.user (id int, name string) row format delimited fields terminated by '\t'"
Logging initialized using configuration in jar:file:/usr/local/hive-0.10.0-cdh4.5.0/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_25cba8fe-84b4-4520-858a-14d0f64c36dc_262885078.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.0.0-cdh4.5.0/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hive-0.10.0-cdh4.5.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
OK
Time taken: 6.697 seconds

データベース・テーブルを作るところまではイケてるようです.LOAD DATAするためのTSVファイルを用意して,テーブルに流し込みます.

$ cat /tmp/dat.tsv 
123     abc
345     cde

$ hive -e 'load data local inpath "/tmp/dat.tsv" into table user_db.user'
Logging initialized using configuration in jar:file:/usr/local/hive-0.10.0-cdh4.5.0/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_358acc9b-afae-4970-a7a3-30b762630615_1611174169.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.0.0-cdh4.5.0/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hive-0.10.0-cdh4.5.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Copying data from file:/tmp/dat.tsv
Copying file: file:/tmp/dat.tsv
Loading data to table user_table.user
-chgrp: '' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
Table user_table.user stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 16, raw_data_size: 0]
OK
Time taken: 8.7 seconds

なんか変なエラーが出てますね.hadoopコマンドを呼びだそうとして引数が行ってないみたい.


中身を見ると,入っているようです.

$ hive -e 'select * from user_db.user'
Logging initialized using configuration in jar:file:/usr/local/hive-0.10.0-cdh4.5.0/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_c15e49a8-8b96-4cdd-b5b3-c27ac8e27181_772650078.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.0.0-cdh4.5.0/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hive-0.10.0-cdh4.5.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
OK
id      name
123     abc
345     cde
Time taken: 6.485 seconds

でも,HiveはSELECT *みたいな単純過ぎるクエリはHadoopをシカトして実行されます.
なので,ちょっとでも変更するとダメでした.

$ hive -e "select * from user_db.user where name like '%a%'"                                                                                                                     
Logging initialized using configuration in jar:file:/usr/local/hive-0.10.0-cdh4.5.0/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_980441f2-46a0-4447-aeb3-e286d513f92d_1468706777.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.0.0-cdh4.5.0/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hive-0.10.0-cdh4.5.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
        at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:122)
        at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:84)
        at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:77)
        at org.apache.hadoop.mapred.JobClient.init(JobClient.java:478)
        at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:457)
        at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:426)
        at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:138)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
        at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:66)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1383)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1169)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:982)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:902)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:347)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:706)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Job Submission failed with exception 'java.io.IOException(Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask

やはりCannot initialize Clusterですね...

あまり理解していないのですが,
http://shun0102.net/?p=198

ただ、fs.default.nameにHDFSを指定しない場合NameNode、DataNodeが起動せず、HDFSを使えないので、併用する場合はawsAccessKeyIdとawsSecretAccessKeyだけ設定してデフォルトはHDFSに設定する必要があります。

という説明もあったので,↓ここの部分を変更して試してみようかと思います.大容量ファイルを扱うので,s3n://よりs3://の方がいいのかな.

<property>
  <name>fs.defaultFS</name>
<!--
  <value>file:///</value>
-->
  <value>s3n://XXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYY@mybacket-01</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>