0

I am trying to analyse Twitter Data using Cloudera. Currently, I am able to stream Twitter Data into HDFS via Flume but I am experiencing issues when trying to query data using SQL from Hive table getting following exception:

java.io.IOException: org.apache.avro.AvroRuntimeException:   java.io.IOException: Block size invalid or too large for this implementation: -40

Does this mean that the data was loaded into Hive but cannot be queried or was it not loaded into Hive at all?

My flume.conf file is

TwitterAgent.sources = Twitter
TwitterAgent.channels = FileChannel
TwitterAgent.sinks = HDFS

#TwitterAgent.sources.Twitter.type =    com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = FileChannel
TwitterAgent.sources.Twitter.consumerKey = nmmRpbWjQPAViWlJLjkJuq7mO
TwitterAgent.sources.Twitter.consumerSecret = *****
TwitterAgent.sources.Twitter.accessToken = *****
TwitterAgent.sources.Twitter.accessTokenSecret = *****
TwitterAgent.sources.Twitter.maxBatchSize = 50000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100

#TwitterAgent.sources.Twitter.keywords = Canada, TTC,ttc, Toronto, Free, and, Apache,city, City, Hadoop, Mapreduce, hadooptutorial, Hive, Hbase, MySql

TwitterAgent.sinks.HDFS.channel = FileChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://quickstart.cloudera:8020/user/hive/warehouse/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 100

TwitterAgent.channels.FileChannel.type = file
TwitterAgent.channels.FileChannel.checkpointDir = /var/log/flume-ng/checkpoint/
TwitterAgent.channels.FileChannel.dataDirs = /var/log/flume-ng/data/

I have added added JAR file "hive-serdes-1.0-SNAPSHOT.jar"

ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar

my .avsc location is '/home/cloudera/twitterDataAvroSchema.avsc' and having bellow code-

{"type":"record",
 "name":"Doc",
 "doc":"adoc",
 "fields":[{"name":"id","type":"string"},
       {"name":"user_friends_count","type":["int","null"]},
       {"name":"user_location","type":["string","null"]},
       {"name":"user_description","type":["string","null"]},
       {"name":"user_statuses_count","type":["int","null"]},
       {"name":"user_followers_count","type":["int","null"]},
       {"name":"user_name","type":["string","null"]},
       {"name":"user_screen_name","type":["string","null"]},
       {"name":"created_at","type":["string","null"]},
       {"name":"text","type":["string","null"]},
       {"name":"retweet_count","type":["long","null"]},
       {"name":"retweeted","type":["boolean","null"]},
       {"name":"in_reply_to_user_id","type":["long","null"]},
       {"name":"source","type":["string","null"]},
       {"name":"in_reply_to_status_id","type":["long","null"]},
       {"name":"media_url_https","type":["string","null"]},
       {"name":"expanded_url","type":["string","null"]}
      ]
}

Used commend bellow to create hive table

CREATE TABLE my_tweets
  ROW FORMAT SERDE
 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES ('avro.schema.url'='file:///home/cloudera/twitterDataAvroSchema.avsc') ;

Used folloing command to upload data to Hive table

LOAD DATA INPATH '/user/hive/warehouse/tweets/FlumeData.*' OVERWRITE INTO TABLE my_tweets;

== output ===
Loading data to table robin.my_tweets
Table robin.my_tweets stats: [numFiles=1, numRows=0, totalSize=421380, rawDataSize=0]
OK
Time taken: 1.928 seconds

Got Error while Trying SQL from

ERROR

hive> select user_location from robin.my_tweets;
OK

Failed with exception java.io.IOException:org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40

Time taken: 1.247 seconds

I am using Cloureda version=2.6.0-cdh5.5.0

Any assistance on this issue is appreciated.

Thanks
Robin

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Robin
  • 128
  • 1
  • 2
  • 7
  • Please have a look at this post http://stackoverflow.com/questions/30661478/unable-to-correctly-load-twitter-avro-data-into-hive-table – shankarsh15 May 19 '16 at 02:54
  • Thanks Shankarsh, I have looked into mentioned link. However, did not get solution of my issue. Do you have anymore suggestions? – Robin May 19 '16 at 05:05

0 Answers0