0

I have a directory of tab separated log files with varying dimensions and I am trying to load them into R.

Dir:
File1 (col1,col2,col3)
File2 (col3,col4,col5,col6,col7)
File3 (col1,col8,col9,col10)

To do this: I concatenated all the files in the directory to: all_files.tsv

When I tried to load them in R, as expected, it gave me an error message:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 436 did not have 12 elements

The code I am using is:

 data <- read.table("all_vid_logs.tsv",
                   header=FALSE,
                   sep="\t"         # use "\t" for tab-delimited files
    )

So, my question is: 1. What is the best way to load all these files into a dataframe in R?

The output I am expecting is a single flat structure with all the columns.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
BRZ
  • 695
  • 4
  • 13
  • 25
  • Apparently they have different line lengths (nrows). Tell us nrows as well as ncols for each file. – smci Apr 08 '15 at 00:10
  • Would guess you concatenated it wrongly or R is not handling appropriately. Why don't you do as @smci suggests? Something like `files <- list.files(".", pattern="*.tsv", full.names=T)` and then just lapply an appropiate read.table for your files and rbind.fill them? – animalito Apr 08 '15 at 00:16
  • 1
    Have you tried the `fill = TRUE` argument in `read.table()` ? – tospig Apr 08 '15 at 00:23
  • **Tell us nrows as well as ncols for each file?** – smci Apr 08 '15 at 00:25
  • [This answer](http://stackoverflow.com/a/1874563/4002530) may also be of use. – tospig Apr 08 '15 at 00:41
  • I think you need to consider merging after reading them separately. – IRTFM Apr 08 '15 at 01:29
  • @smci: nrows for each is very small...3 or 4 - maybe a max of 10. ncols is varying based upon one of the attributes. max #ncols for a specific type could be 15. again, small dataset - 1000 files - just varying lengths thats all. – BRZ Apr 08 '15 at 13:02

1 Answers1

1

Apparently they have different lengths (nrows). Then read.table/read.csv might not be able to read your concatenated file.

So read them in separately into individual dataframes. Then figure out what join operation you need to do, with NA-filling.

df1 <- read.csv(file1, ...)
df2 <- read.csv(file2, ...)
smci
  • 32,567
  • 20
  • 113
  • 146
  • Hello, thanks for your response. I was hoping that using the Fill=NA option will work. Tried that but it didnt...am a bit surprised that it error-ed out. – BRZ Apr 08 '15 at 00:17
  • 1
    It's **fill=TRUE**, not NA. But it behaves differently with `read.csv()` (adds an NA column) than `read.table()` (leaves the column ragged). – smci Apr 08 '15 at 00:30
  • Hello! Fill = TRUE seemed to have worked. Thanks for that...Given that the directory has 1000+ files, is it practical to create a dataframe for each and then merge? What is the optimal approach in this case? Any idea? – BRZ Apr 08 '15 at 13:02
  • Actually, didnt work. My bad. All the columns are shifted and overlap each other... – BRZ Apr 08 '15 at 13:12