I was wondering if it's possible to parse any fixed width file without knowing anything about it and making it into a CSV. My intuition says no because there could be some edge cases. If you know the width, but not the column names, then that's fine. If you know the column names, then you can figure out the width, so that's fine. But if you don't have both, I can imagine that perhaps with smart enough logic you could do it if you read the file over once before you actually start parsing. perhaps. But if that's also a constraint (must read the file once), then you're out of luck, correct? Also assume that this is being streamed because the file is 50GB and cannot be loaded into memory. So, to go over my goal and constraints:
Goal: To successfully convert a fixed width file having no information about it, most notably the column names and the width length
Constraints: 1. I am expecting the file to be very large, so I must stream it and not load it into memory, and it would be terribly inefficient to read it twice. 2. I have no information on the column names, the width, or anything really - I am just receiving a fixed width file.
Given these constraints, is the goal possible? I know that in the simple case, say something like this:
Love Lucy Is Awesome
data datatat datad datadaa
Well, whatever. Because the column names don't have any spaces in them, it's easy. But what I can't really figure out is a complex case like this:
The Swimming Pool Is Dirty
data data data data
data datada data data data
I can never know if "Swimming Pool " is one column or if "Swimming " and "Pool " are two columns until going through the file. If all my rows indicate four records, then "Swimming Pool " is one column; five, then they are two columns.
In fact, even that's an assumption; perhaps "Pool " is just always null. I mean, even this case:
Swimming Pool
datadatadatat
I don't know enough about fixed width files, but must there be a space, at least, between two records? This could be translated as:
Swimming, Pool
datadatad,atat
So, my conundrum unfolds to you. Honestly, I am not even sure the simple case is truly simple. Maybe Lucy Is .
is one column. First time dealing with this file type (or even really hearing about) it, and would like a professional's thoughts.