CSV with multiple datasets/different-number-of-columns

Question

Similar to How can you read a CSV file in R with different number of columns, I have some complex CSV-files. Mine are from SAP BusinessObjects and hold challenges different to those of the quoted question. I want to automate the capture of an arbitrary number of datasets held in one CSV file. There are many CSV-files, but let's start with one of them.

Given: One CSV file containing several flat tables.

Wanted: Several dataframes or other structure holding all data (S4?)

The method so far:

get line numbers of header data by counting number of columns
get headers by reading every line index held in vector described above
read data by calculating skip and nrows between data sets in index described by header lines as above.
give the read data column names from read header

I need help getting me on the right track to avoid loops/making the code more readable/compact when reading headers and datasets.

These CSVs are formatted as normal CSVs, only that they contain an more or less arbitrary amount of subtables. For each dataset I export, the structure is different. In the current example I will suppose there are five tables included in the CSV.

In order to give you an idea, here is some fictous sample data with line numbers. Separator and quote has been stripped:

1:     n, Name, Species, Description, Classification
2:     90, Mickey, Mouse, Big ears, rat
3:     45, Minnie, Mouse, Big bow, rat
...
16835: Code, Species
16836: RT, rat
...
22673: n, Code, Country
22674: 1, RT, Murica
...
33211: Activity, Code, Descriptor
32212: running, RU, senseless activity
...
34749: Last update
34750: 2017/05/09 02:09:14

There are a number of ways going about reading each data set. What I have come up with so far:

filepath <- file.path(paste0(Sys.getenv("USERPROFILE"), "\\SAMPLE.CSV)

# Make a vector with column number per line
fieldVector <- utils::count.fields(filepath, sep = ",", quote = "\"")

# Make a vector with unique number of fields in file
nFields <- base::unique(fieldVector)

# Make a vector with indices for position of new dataset
iHeaders <- base::match(nFields, fieldVector)

With this, I can do things like:

header <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4], nrows = iHeaders[5]-iHeaders[4]-1)

data <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4] + 1, nrows = iHeaders[5]-iHeaders[4]-1)

names(data) <- header

As in the intro of this post, I have made a couple of functions which makes it easier to get headers for each dataset:

Headers <- GetHeaders(filepath, iHeaders)
colnames(data) <- Headers[[4]]

I have two functions now - one is GetHeader, which captures one line from the file with utils::read.csv2 while ensuring safe headernames (no æøå % etc).

The other returns a list of string vectors holding all headers:

GetHeaders <- function(filepath, linenums) {
    # init an empty list of length(linenums) 
    l.headers <-  vector(mode = "list", length = length(linenums))

    for(i in seq_along(linenums)) {
            # read.csv2(filepath, skip = linenums[i]-1, nrows = 1)
        l.headers[[i]] <- GetHeader(filepath, linenums[i])
    }
    l.headers
}

What I struggle with is how to read in all possible datasets in one go. Specifically the last set is a bit hard to wrap my head around if I should write a common function, where I only know the line number of header, and not the number of lines in the following data.

Also, what is the best data structure for such a structure as described? The data in the subtables are all relevant to each other (can be used to normalize parts of the data). I understand that I must do manual work for each read CSV, but as I have to read TONS of these files, some common functions to structure them in a predictable manner at each pass would be excellent.

Before you answer, please keep in mind that, no, using a different export format is not an option.

Thank you so much for any pointers. I am a beginner in R and haven't completely wrapped my head around all possible solutions in this particular domain.

Does your csv data really have the ... in it? If not is there any other separator between the tables? If so it should be straight forward to use regex to extract out each table. Also in your post you said "here is some fictitious sample data with line numbers. Separator and quote has been stripped" - I think it will go better if you post something that exactly matches your raw data structure. — Ian Wesley, May 09 '17 at 15:07
Curious, how did you get a CSV in such a format? Are you dumping out a [SAP BO *report*](http://stackoverflow.com/questions/30323923/how-to-export-csv-from-business-object-report) instead of data? I would check source before running intricate R cleanup. Usually CSVs contain one dataset at a time. — Parfait, May 09 '17 at 18:25
Ian: no, the ... is signyfying lines in between. Using real data would not make sense, as there are 23 columns in the first dataset. I can make a script making dummy data if that helps. There are no gaps inbetween the datasets. Wouldn't a RegEx have to read and parse every line individually? Parfait: second to last paragraph. — Espen Rosenquist, May 09 '17 at 19:54
Oh, Parfait. Yes, pertinent to the version we have, partly, but also to limitations of user choice. Exporting a report is possible, but then as pivoted excel 2003 that is a nightmare to treat (joined cells, empty lines, changing intervals, shortened/rounded numbers) or txt-file having no delimitors. Besides, those formats have a whole lot other problems. Leave it at: it's how it is. I have to deal woth them. :) — Espen Rosenquist, May 09 '17 at 20:16

CSV with multiple datasets/different-number-of-columns

0 Answers0