Reading and concatenating CSV files with varying (nrow, ncol) dimensions

Question

I have a directory of tab separated log files with varying dimensions and I am trying to load them into R.

Dir:
File1 (col1,col2,col3)
File2 (col3,col4,col5,col6,col7)
File3 (col1,col8,col9,col10)

To do this: I concatenated all the files in the directory to: all_files.tsv

When I tried to load them in R, as expected, it gave me an error message:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 436 did not have 12 elements

The code I am using is:

 data <- read.table("all_vid_logs.tsv",
                   header=FALSE,
                   sep="\t"         # use "\t" for tab-delimited files
    )

So, my question is: 1. What is the best way to load all these files into a dataframe in R?

The output I am expecting is a single flat structure with all the columns.

Apparently they have different line lengths (nrows). Tell us nrows as well as ncols for each file. — smci, Apr 08 '15 at 00:10
Would guess you concatenated it wrongly or R is not handling appropriately. Why don't you do as @smci suggests? Something like `files <- list.files(".", pattern="*.tsv", full.names=T)` and then just lapply an appropiate read.table for your files and rbind.fill them? — animalito, Apr 08 '15 at 00:16
Have you tried the `fill = TRUE` argument in `read.table()` ? — tospig, Apr 08 '15 at 00:23
[This answer](http://stackoverflow.com/a/1874563/4002530) may also be of use. — tospig, Apr 08 '15 at 00:41
I think you need to consider merging after reading them separately. — IRTFM, Apr 08 '15 at 01:29
@smci: nrows for each is very small...3 or 4 - maybe a max of 10. ncols is varying based upon one of the attributes. max #ncols for a specific type could be 15. again, small dataset - 1000 files - just varying lengths thats all. — BRZ, Apr 08 '15 at 13:02

smci · Answer 1 · 2015-04-08T00:31:04.783

1

Apparently they have different lengths (nrows). Then read.table/read.csv might not be able to read your concatenated file.

So read them in separately into individual dataframes. Then figure out what join operation you need to do, with NA-filling.

df1 <- read.csv(file1, ...)
df2 <- read.csv(file2, ...)

edited Apr 08 '15 at 00:31

answered Apr 08 '15 at 00:10

smci

Hello, thanks for your response. I was hoping that using the Fill=NA option will work. Tried that but it didnt...am a bit surprised that it error-ed out. – BRZ Apr 08 '15 at 00:17
1

It's **fill=TRUE**, not NA. But it behaves differently with `read.csv()` (adds an NA column) than `read.table()` (leaves the column ragged). – smci Apr 08 '15 at 00:30
Hello! Fill = TRUE seemed to have worked. Thanks for that...Given that the directory has 1000+ files, is it practical to create a dataframe for each and then merge? What is the optimal approach in this case? Any idea? – BRZ Apr 08 '15 at 13:02
Actually, didnt work. My bad. All the columns are shifted and overlap each other... – BRZ Apr 08 '15 at 13:12

1 Answers1