Similar to How can you read a CSV file in R with different number of columns, I have some complex CSV-files. Mine are from SAP BusinessObjects and hold challenges different to those of the quoted question. I want to automate the capture of an arbitrary number of datasets held in one CSV file. There are many CSV-files, but let's start with one of them.
Given: One CSV file containing several flat tables.
Wanted: Several dataframes or other structure holding all data (S4?)
The method so far:
- get line numbers of header data by counting number of columns
- get headers by reading every line index held in vector described above
- read data by calculating skip and nrows between data sets in index described by header lines as above.
- give the read data column names from read header
I need help getting me on the right track to avoid loops/making the code more readable/compact when reading headers and datasets.
These CSVs are formatted as normal CSVs, only that they contain an more or less arbitrary amount of subtables. For each dataset I export, the structure is different. In the current example I will suppose there are five tables included in the CSV.
In order to give you an idea, here is some fictous sample data with line numbers. Separator and quote has been stripped:
1: n, Name, Species, Description, Classification
2: 90, Mickey, Mouse, Big ears, rat
3: 45, Minnie, Mouse, Big bow, rat
...
16835: Code, Species
16836: RT, rat
...
22673: n, Code, Country
22674: 1, RT, Murica
...
33211: Activity, Code, Descriptor
32212: running, RU, senseless activity
...
34749: Last update
34750: 2017/05/09 02:09:14
There are a number of ways going about reading each data set. What I have come up with so far:
filepath <- file.path(paste0(Sys.getenv("USERPROFILE"), "\\SAMPLE.CSV)
# Make a vector with column number per line
fieldVector <- utils::count.fields(filepath, sep = ",", quote = "\"")
# Make a vector with unique number of fields in file
nFields <- base::unique(fieldVector)
# Make a vector with indices for position of new dataset
iHeaders <- base::match(nFields, fieldVector)
With this, I can do things like:
header <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4], nrows = iHeaders[5]-iHeaders[4]-1)
data <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4] + 1, nrows = iHeaders[5]-iHeaders[4]-1)
names(data) <- header
As in the intro of this post, I have made a couple of functions which makes it easier to get headers for each dataset:
Headers <- GetHeaders(filepath, iHeaders)
colnames(data) <- Headers[[4]]
I have two functions now - one is GetHeader, which captures one line from the file with utils::read.csv2 while ensuring safe headernames (no æøå % etc).
The other returns a list of string vectors holding all headers:
GetHeaders <- function(filepath, linenums) {
# init an empty list of length(linenums)
l.headers <- vector(mode = "list", length = length(linenums))
for(i in seq_along(linenums)) {
# read.csv2(filepath, skip = linenums[i]-1, nrows = 1)
l.headers[[i]] <- GetHeader(filepath, linenums[i])
}
l.headers
}
What I struggle with is how to read in all possible datasets in one go. Specifically the last set is a bit hard to wrap my head around if I should write a common function, where I only know the line number of header, and not the number of lines in the following data.
Also, what is the best data structure for such a structure as described? The data in the subtables are all relevant to each other (can be used to normalize parts of the data). I understand that I must do manual work for each read CSV, but as I have to read TONS of these files, some common functions to structure them in a predictable manner at each pass would be excellent.
Before you answer, please keep in mind that, no, using a different export format is not an option.
Thank you so much for any pointers. I am a beginner in R and haven't completely wrapped my head around all possible solutions in this particular domain.