I'm looking for a way to read from a list of csv files and convert each row into json format. Assuming I cannot get header names beforehand, I must ensure that each worker can read from the beginning of one csv file, otherwise we don't know the header names.
My plan is to use FileIO.readMatches to get ReadableFile as elements, and for each element, read the first line as header and combine header with each other line into json format. My questions are:
- Is it safe to assume ReadableFile will always contain a whole file, not a partial file?
- Will this approach require worker memory to be larger than file size?
- Any other better approaches?
Thanks!