I'm using Kafka to send a file with 3 columns using Spark streaming 1.3 to insert into HBase. This is how my HBase looks like :
ROW COLUMN+CELL
zone:bizert column=travail:call, timestamp=1491836364921, value=contact:numero
zone:jendouba column=travail:Big data, timestamp=1491835836290, value=contact:email
zone:tunis column=travail:info, timestamp=1491835897342, value=contact:num
3 row(s) in 0.4200 seconds
And this is how I read data with spark streaming, I'm using spark-shell
:
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.kafka.KafkaUtils
import kafka.serializer.StringDecoder
val ssc = new StreamingContext(sc, Seconds(10))
val topicSet = Set ("zed")
val kafkaParams = Map[String, String]("metadata.broker.list" -> "xx.xx.xxx.xx:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
lines.foreachRDD(rdd => { (!rdd.partitions.isEmpty)
lines.saveAsTextFiles("hdfs://xxxxx:8020/user/admin/zed/steams3/")
})
this code is working when I'm saving data into HDFS even it save many empty data to HDFS. before writing this question I was searching here and some other question like mine but I didn't get a good solution.
May you propose the best way to do this?. This is how my code look now
val sc = new SparkContext("local", "Hbase spark")
val tableName = "notz"
val conf = HBaseConfiguration.create()
conf.addResource(new Path("file:///opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/etc/hbase/conf.dist/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
lines.foreachRDD(rdd => { (!rdd.partitions.isEmpty)
if(!admin.isTableAvailable(tableName)) {
print("Creating GHbase Table")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("zone"
.getBytes()))
admin.createTable(tableDesc)
}else{
print("Table already exists!!")
}
val myTable = new HTable(conf, tableName)
// i'm blocked here
})