3

I am working on HDP (Hortonworks) and trying to collect Tweets through flume and to load stored data from Hive.

The problem is select * from tweetsavro limit 1; works but select * from tweetsavro limit 2; does not work because

Failed with exception java.io.IOException:org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40

What I did is written in this answer. Namely

twitter.conf

TwitterAgent.sources = Twitter 
TwitterAgent.channels = MemChannel 
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = xxx
TwitterAgent.sources.Twitter.consumerSecret = xxx
TwitterAgent.sources.Twitter.accessToken = xxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxx

TwitterAgent.sinks.HDFS.type = hdfs 
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://sandbox.hortonworks.com:8020/user/flume/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream 
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.sinks.HDFS.serializer = Text

TwitterAgent.channels.MemChannel.type = memory 
TwitterAgent.channels.MemChannel.capacity = 10000 
TwitterAgent.channels.MemChannel.transactionCapacity = 1000

TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel

twitter.avsc is created by the following command.

java -jar avro-tools-1.7.7.jar getschema FlumeData.1503479843633 > twitter.avsc

I created a table by

CREATE TABLE tweetsavro
  ROW FORMAT SERDE
     'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
     'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
     'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES ('avro.schema.url'='hdfs://sandbox.hortonworks.com:8020/user/flume/twitter.avsc') ;
LOAD DATA INPATH 'hdfs://sandbox.hortonworks.com:8020/user/flume/twitter_data/FlumeData.*' OVERWRITE INTO TABLE tweetsavro;

Remarks:

  • I tried an external Table (instead of a managed one). But the situation did not change.
  • Because I use Hortonworks, I do not use Cloudera's TwitterSource.
H. Shindoh
  • 906
  • 9
  • 23

1 Answers1

0

add this to your configuration file

TwitterAgent.sources.Twitter.maxBatchSize = 50000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100000