0

I have a dataset stored in text file, it is of 997 columns, 45000 rows. All values are double values except row names and column names. I use R studio with read.table command to read the data file, but it seems taking hours to do it. Then I aborted it.

Even I use Excel to open it, it takes me 2 minutes.

R Studio seems lacking of efficiency in this task, any suggestions given how to make it faster ? I dont want to read the data file all the time ?

I plan to load it once and store it in Rdata object, which can make the loading data faster in the future. But the first load seems not working.

I am not a computer graduate, any kind help will be well appreciated.

Roland
  • 127,288
  • 10
  • 191
  • 288
ToBeGeek
  • 1,105
  • 4
  • 12
  • 20
  • 3
    http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r – Roland Feb 09 '14 at 15:17
  • 2
    R studio is just and IDE that uses R functions to do tasks. In occurence , to read files it uses `read.table` functions (family). Give a try to `fread(your_file_path)` from `data.table` package( very fast but don't work all the times) – agstudy Feb 09 '14 at 15:17
  • Fread is great for large files, but your file isn't that large, `read.table` should work. Have you tried `stringsAsFactors = FALSE`? Take a look at the help for setting other arguments. http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html – marbel Feb 09 '14 at 16:09
  • ignoring the load times, why is the first read not working? if you want to selectively load part of the file then try `read.csv.sql` from `sqldf` package. – Nishanth Feb 09 '14 at 16:42

2 Answers2

0

I recommend data.table although you will end up with a data table after this. If you choose not to use the data table, you can simply convert back to a normal data frame.

require(data.table)
data=fread('yourpathhere/yourfile')
Roland
  • 127,288
  • 10
  • 191
  • 288
ProbablePattern
  • 703
  • 2
  • 7
  • 17
0

As documented in the ?read.table help file there are three arguments that can dramatically speed up and/or reduce the memory required to import data. First, by telling read.table what kind of data each column contains you can avoid the overhead associated with read.table trying to guess the type of data in each column. Secondly, by telling read.table how many rows the data file has you can avoid allocating more memory than is actually required. Finally, if the file does not contain comments, you can reduce the resources required to import the data by telling R not to look for comments. Using all of these techniques I was able to read a .csv file with 997 columns and 45000 rows in under two minutes on a laptop with relatively modest hardware:

tmp <- data.frame(matrix(rnorm(997*45000), ncol = 997))
write.csv(tmp, "tmp.csv", row.names = FALSE)

system.time(x <- read.csv("tmp.csv", colClasses="numeric", comment.char = ""))
#   user  system elapsed 
#115.253   2.574 118.471

I tried reading the file using the default read.csv arguments, but gave up after 30 minutes or so.

Ista
  • 10,139
  • 2
  • 37
  • 38