3

Situation: 1GB CSV file, 100000 rows, 4000 independent numeric variable, 1 dependent variable. R on Windows Citrix Server, with 16GB memory.

Problem: It took me 2 hours! to do:

read.table("full_data.csv", header=T, sep",")

and the glm process crashes, the program is not responding, and I have to shut it down in Task Manager.

Andre Silva
  • 4,782
  • 9
  • 52
  • 65
TongZZZ
  • 756
  • 2
  • 8
  • 20
  • 1
    `?read.table` tells you how to make it run faster. "Process crashes" and "program not responding" are different things; which is it? How long did you wait before killing R via Task Manager? – Joshua Ulrich Jul 09 '12 at 18:32
  • Maybe try [biglm](http://cran.at.r-project.org/web/packages/biglm/index.html)? – joran Jul 09 '12 at 18:36
  • program not responding, 10 mins waiting. – TongZZZ Jul 09 '12 at 18:54
  • does biglm support logistic regression? – TongZZZ Jul 09 '12 at 18:56
  • or can I read in 1~1000 rows and then 1001~2000 rows and ....? would this make the read in process faster? – TongZZZ Jul 09 '12 at 18:58
  • thanks Joshua, seems colClassess option could save some format converting time, is this all I could do? – TongZZZ Jul 09 '12 at 19:03
  • Save the data after reading it in to prevent having to read it in again if R crashes. Re. biglm, read the documentation. – Hansi Jul 09 '12 at 22:01
  • about the slow reading. Is the file you are reading on the local drive or on another network/server (not unlikely since you mention you are already working on a Citrix server)? If the latter, can you first create a local copy? – flodel Jul 10 '12 at 01:15
  • general hints: `data.table::fread`; `biglm` package includes `bigglm`. – Ben Bolker Nov 21 '14 at 13:54

1 Answers1

3

I often resort to the package sqldf to load large .csv in memory. A good pointer is here.

Community
  • 1
  • 1
Ryogi
  • 5,497
  • 5
  • 26
  • 46