Optimizing File reading in R

Question

My R application reads input data from large txt files. it does not read the entire file in one shot. Users specify the name of the gene, (3 or 4 at a time) and based on the user-input, app goes to the appropriate row and reads the data.

File format: 32,000 rows (one gene per row, first two columns contain info about gene name, etc.) 35,000 columns with numerical data (decimal numbers).

I used read.table (filename, skip=10,000 ) etc. to go to the right row, then read 35,000 columns of data. then I do this again for the 2nd gene, 3rd gene (upto 4 genes max) and then process the numerical results.

The file reading operations take about 1.5 to 2.0 Minutes. I am experimenting with reading the entire file and then taking the data for the desired genes.

Any way to accelerate this? I can rewrite the gene data in another format (one time processing) if that will accelerate reading operations in the future.

http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r/15058684#15058684 — eddi, Jun 27 '13 at 16:42

score 2 · Answer 1 · answered Jun 27 '13 at 16:35

2

You can use the colClasses argument to read.table to speed things up, if you know the exact format of your files. For 2 character columns and 34,998 (?) numeric columns, you would use

colClasses = c(rep("character",2), rep("numeric",34998))

answered Jun 27 '13 at 16:35

Hong Ooi

56,353
13
134
187

score 2 · Answer 2 · answered Jun 27 '13 at 16:42

This would be more efficient if you used a database interface. There are several available via the RODBC package, but a particularly well-integrated-with-R option would be the sqldf package which by default uses SQLite. You would then be able to use the indexing capacity of the database to do lookup of the correct rows and read all the columns in one operation.

Optimizing File reading in R

2 Answers2