0

I'm looking for a way to read from a list of csv files and convert each row into json format. Assuming I cannot get header names beforehand, I must ensure that each worker can read from the beginning of one csv file, otherwise we don't know the header names.

My plan is to use FileIO.readMatches to get ReadableFile as elements, and for each element, read the first line as header and combine header with each other line into json format. My questions are:

  • Is it safe to assume ReadableFile will always contain a whole file, not a partial file?
  • Will this approach require worker memory to be larger than file size?
  • Any other better approaches?

Thanks!

Mingxi
  • 11
  • 1
  • Does this answer your question? [Reading CSV header with Dataflow](https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow) – rmesteves Jun 29 '20 at 07:34

1 Answers1

0
  1. Yes, ReadableFile will always give you a whole file.

  2. No. As you go through the file line-by-line, you first read one line to determine the columns, then you read each line to output the rows - this should work!

  3. This seems like the right approach to me, unless you have few files that are very large (GBs, TBs). If you have at least a dozen or a few dozen files, you should be fine.

An extra tip - it may be convenient to insert an apply(Reshuffle.viaRandomKey()) in between your CSV parser and your next transform. This will allow you to shuffle the output of each file into multiple workers downstream - it will give you more parallelism downstream.

Good luck! Feel free to ask follow up questions in the comments.

Pablo
  • 10,425
  • 1
  • 44
  • 67