1

I need to consider how to write my data to Hadoop.

I'm using Spark, I got a message from Kafka topic, each message in JSON record.

I have around 200B records per day.

The data fields may be change (not alot but may be change in the future),

I need fast write and fast read, low size in disk.

What should I choose? Avro or Parquet?

I also read the following https://community.hitachivantara.com/community/products-and-solutions/pentaho/blog/2017/11/07/hadoop-file-formats-its-not-just-csv-anymore And Avro v/s Parquet

But still no idea what to choose,

Any suggestions?

Ya Ko
  • 509
  • 2
  • 4
  • 19
  • Maybe both. Look into _Hoodie_, by Uber -- why they needed a datastore for "hot" data, including mutations (update/delete operations), plus another read-optimised datastore for "cold" data, with incremental merge of "hot" and "cold"; plus an abstraction on top to tap on both when reading. Just what HBase or Cassandra or RocksDB do, but for random key/value access, while Uber needed it for batch reads and analytics. – Samson Scharfrichter Jul 01 '18 at 16:10
  • 1
    Also, JSON is verbose. Very verbose. At massive scale, Kafka may start choking on the sheer volume -- unless you switch to AVRO or something similar (Criteo chose Protobuf) or find out what is the best compression option (something CloudFlare did, https://blog.cloudflare.com/squeezing-the-firehose/) – Samson Scharfrichter Jul 01 '18 at 16:31

1 Answers1

2

If you care about storage and queries, optimal storage types in order are

  • ORC
  • Parquet
  • Avro
  • JSON
  • CSV/TSV (plain structured text)
  • unstructed text

If you are limited in disk space and would like to sacrifice retrieval, Snappy or Bzip2 would be best, with Bzip2 being more compressed.

Typically, I see people write JSON data directly to Hadoop, then batch a job to convert it daily, for example, into a more optional format (e.g. Hadoop prefers very large files rather than lots of tiny ones)

If you care about retrieval speed, use HBase or some other database (Hive is not a database), but at the very least, you will need to compact streaming data into larger time chunks according to your business needs.

Avro natively supports schema evolution, and if you are able to install the Confluent Schema Registry along side your existing Kafka Cluster, then you can just use Kafka HDFS Connect to write Parquet immediately from Avro (or JSON, I think, assuming you have a schema field in the message) into HDFS along with a Hive table.

Other options include Apache Nifi or Streamsets. In other words, don't reinvent the wheel writing Spark code to pull Kafka to HDFS

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • HI, Thanks for your answer., about "I see people write JSON data directly to Hadoop, then batch a job to convert it daily" Can i write the JSON to "temp" table with one column that contains only the JSON string then to make a job that convert it into my table? Its recommended in term of performance? – Ya Ko Jul 02 '18 at 07:41
  • I would suggest you use a JSONSerde in Hive rather than a string column, but you don't "need" a table. You could just write the JSON, then have Spark or Pig, for example, process it into another table themselves – OneCricketeer Jul 02 '18 at 08:15
  • You mean that I can just to write the JSON to table that I'm define as JsonSerDe with JSON column then to make Spark/Pig to process the JSON fields to another table? – Ya Ko Jul 02 '18 at 08:41
  • You write plaintext (in JSON form) to HDFS. You optionally create a Hive table over that using a JsonSerde. From there, you can query and parse using any Hive compatible library. Or you can use SparkSQL to read back the JSON directly in HDFS, skipping Hive, using a given or inferred Schema – OneCricketeer Jul 02 '18 at 09:05
  • Oh ok, And about the read performance? Its better to create a table with JSON field and write it with SparkSQL as parquet file? – Ya Ko Jul 02 '18 at 09:31
  • Again, ORC is better for reading, and Spark has ORC support. You can read JSON data in Spark, apply a struct schema on it, then `write.format("orc").save("path")` – OneCricketeer Jul 02 '18 at 09:33
  • You shouldn't save a JSON string within any format. Hive is not tabular, it can have nested structs, maps, and arrays. Don't be afraid to parse the JSON into a struct object ahead of saving the dataframe – OneCricketeer Jul 02 '18 at 09:35
  • Ok, Thank you!! – Ya Ko Jul 02 '18 at 09:36