3

I have a CSV file that contains grain size analysis data. The type of data isn't particularly important for my question - I think the question applies to spreadsheets of data values in general. One of the columns ("mode in phi") that is returned from the lab analysis can contain multiple values if the sample is multi-modal. Usually the highest number of mode values is 3.

Is it better to store the values as a list in a single column or multiple columns with a single value in each column (with "NA" when necessary) for this type of data structure? Is there another option I'm unaware of?

Pros and cons I've considered:
Single column pros: nice to have a single column, values are separated with a semicolon so they're easily distinguished from comma-delimited columns and could be parsed programmatically.
Single column cons: less machine readable because the cell is read as a string rather than numbers.
Multi-column pros: each cell has a single value so it's easily read.
Multi-column cons: how would a user/machine know how many columns of "mode" there would be -- could be different different between different datasets. Could expand to many columns, potentially. Lots of "NA" values.

After Googling, I saw this SO post and read about first normal form (FNF), but I'm not sure if FNF applies to a single CSV file rather than a relational database. Are there any other standards or recommendations for CSVs of single data files?

I know there are lots of similar questions on SO, but mostly about how to split the multiple values or questions specific to databases. I couldn't find much particular to a single CSV.

Evan
  • 1,960
  • 4
  • 26
  • 54

0 Answers0