1

I was wondering if it's possible to parse any fixed width file without knowing anything about it and making it into a CSV. My intuition says no because there could be some edge cases. If you know the width, but not the column names, then that's fine. If you know the column names, then you can figure out the width, so that's fine. But if you don't have both, I can imagine that perhaps with smart enough logic you could do it if you read the file over once before you actually start parsing. perhaps. But if that's also a constraint (must read the file once), then you're out of luck, correct? Also assume that this is being streamed because the file is 50GB and cannot be loaded into memory. So, to go over my goal and constraints:

Goal: To successfully convert a fixed width file having no information about it, most notably the column names and the width length

Constraints: 1. I am expecting the file to be very large, so I must stream it and not load it into memory, and it would be terribly inefficient to read it twice. 2. I have no information on the column names, the width, or anything really - I am just receiving a fixed width file.

Given these constraints, is the goal possible? I know that in the simple case, say something like this:

Love    Lucy    Is    Awesome    
data    datatat datad datadaa

Well, whatever. Because the column names don't have any spaces in them, it's easy. But what I can't really figure out is a complex case like this:

The   Swimming Pool  Is    Dirty
data  data           data  data
data  datada   data  data  data

I can never know if "Swimming Pool " is one column or if "Swimming " and "Pool " are two columns until going through the file. If all my rows indicate four records, then "Swimming Pool " is one column; five, then they are two columns.

In fact, even that's an assumption; perhaps "Pool " is just always null. I mean, even this case:

Swimming Pool  
datadatadatat

I don't know enough about fixed width files, but must there be a space, at least, between two records? This could be translated as:

Swimming, Pool
datadatad,atat

So, my conundrum unfolds to you. Honestly, I am not even sure the simple case is truly simple. Maybe Lucy Is . is one column. First time dealing with this file type (or even really hearing about) it, and would like a professional's thoughts.

John Lexus
  • 3,576
  • 3
  • 15
  • 33

1 Answers1

1

No !!!

It is only very simplest Fixed width files could be passed in this way. Fixed width files can

  • Have multiple record layouts
  • Binary fields
  • Could be Cobol files
  • For some fields you need to know what the field definition is to correctly interpret them. For example decimal points could be assumed i.e. 12345 could be 123.45, 1.2345 etc.
  • Text fields are normally left justified,

For fixed width files you need a file Description (chema)

Cobol File

A common source of Fixed width files is from Cobol applications. Cobol Fixed width files

  • Never have Column headings
  • Generally no space between fields
  • Could have binary fields
  • Decimal points are assumed
  • Zoned Decimal

Have a look at the file in this questions

Software

  • Microsoft Excel / Access + most spreadsheets has Fixed width import wizards
  • RecordEditor/Recsveditor have wizards for fixed width files + can edit fixed width files
Bruce Martin
  • 10,358
  • 1
  • 27
  • 38