1

I am attempting to use readLines to import a 17.6GB csv file into R. I have tried several approaches discussed here, here, here, and elsewhere and readLines seems to be the only approach that effectively at least can get the data into R.

The problem is that I am unable to convert the output from readLines into a data frame which I can use in my analysis. The answers to a related question here are not helping me solve my problem.

Here is my sample data:

write.csv(data.frame(myid=1:10,var=runif(10)),"temp.csv")

dt<-data.frame(myid=1:10,var=runif(10))
dt

myid       var
1     1 0.5949020
2     2 0.8515591
3     3 0.8139010
4     4 0.3804234
5     5 0.4923082
6     6 0.9933775
7     7 0.1740895
8     8 0.8342808
9     9 0.3958154
10   10 0.9690561

creating chunks:

file_i <- file("temp.csv","r")
chunk_size <- 100000 # choose the best size for you
x<- readLines(file_in, n=chunk_size)

Opening the output from readLines in R:

View(x)
x
 [1] "\"\",\"myid\",\"var\""      
 [2] "\"1\",1,0.594902001088485"  
 [3] "\"2\",2,0.851559089729562"  
 [4] "\"3\",3,0.81390100880526"   
 [5] "\"4\",4,0.380423351423815"  
 [6] "\"5\",5,0.492308202432469"  
 [7] "\"6\",6,0.993377464590594"  
 [8] "\"7\",7,0.174089450156316"  
 [9] "\"8\",8,0.834280799608678"  
[10] "\"9\",9,0.395815373631194"  
[11] "\"10\",10,0.969056134112179"

Thanks in advance for any help

Krantz
  • 1,424
  • 1
  • 12
  • 31
  • I have >50,000,000 rows. Is it possible to have a function that implements your suggested idea and then merges the resulting chunks by row? Thanks in advance for any help. – Krantz Mar 10 '19 at 14:00
  • Have you tried `data.table::fread`? – pogibas Mar 10 '19 at 14:03
  • Yes. I have tried `data.table::fread`. No success. Also, have tried: `bigmemory::read.big.matrix`, and `sqldf::read.csv.sql`. All of them return: `Error: cannot allocate vector of size XX Mb`. – Krantz Mar 10 '19 at 14:05
  • 1
    How much RAM do you have? – Sonny Mar 10 '19 at 14:05
  • Please check this question https://stackoverflow.com/questions/5171593/r-memory-management-cannot-allocate-vector-of-size-n-mb – pogibas Mar 10 '19 at 14:06
  • I have 16GB of RAM. – Krantz Mar 10 '19 at 14:06
  • I would go for some command line pre-processing like `sed` or `awk` to clean up input, then fread. – jangorecki Mar 11 '19 at 06:07
  • instead of `View(x)` use `head tmp.csv` from command line so we have clear picture what the csv is like, and not how it is interpreted by R function `View` – jangorecki Mar 11 '19 at 06:10

2 Answers2

3

Here is a complete sequence of instructions to transform the data as you posted into a dataframe.

set.seed(1234)    # Make the results reproducible

write.csv(data.frame(myid=1:10,var=runif(10)),"temp.csv")

dat <- readLines("temp.csv")
df1 <- strsplit(dat[-1], ",")
df1 <- do.call(rbind, df1)
df1 <- df1[,-1]
df1 <- as.data.frame(df1)
df1[] <- lapply(df1, function(x) as.numeric(as.character(x)))

names(df1) <- gsub('"', '', strsplit(dat[1], ',')[[1]][-1], fixed = TRUE)
df1
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
0

Given the output you get after readLines, this must be the content of your CSV file:

"","myid","var"
"1","1","0.5949020"
"2","2","0.8515591"
"3","3","0.8139010"
"4","4","0.3804234"
"5","5","0.4923082"
"6","6","0.9933775"
"7","7","0.1740895"
"8","8","0.8342808"
"9","9","0.3958154"
"10","10","0.9690561"

That is, your values are comma separated and enclosed in double quotes. When I read in this file, I get your output:

dat
 [1] "\"\",\"myid\",\"var\""       "\"1\",\"1\",\"0.5949020\""  
 [3] "\"2\",\"2\",\"0.8515591\""   "\"3\",\"3\",\"0.8139010\""  
 [5] "\"4\",\"4\",\"0.3804234\""   "\"5\",\"5\",\"0.4923082\""  
 [7] "\"6\",\"6\",\"0.9933775\""   "\"7\",\"7\",\"0.1740895\""  
 [9] "\"8\",\"8\",\"0.8342808\""   "\"9\",\"9\",\"0.3958154\""  
[11] "\"10\",\"10\",\"0.9690561\""

So what you need to do is

  • split at the commas
    with unlist(strsplit(..., split = ",")

and

  • replace the escaped double quotes
    with gsub("\"", "", ...)

which gives us:

unlist(strsplit(gsub("\"", "", dat), split = ","))

 [1] ""          "myid"      "var"       "1"         "1"         "0.5949020" "2"        
 [8] "2"         "0.8515591" "3"         "3"         "0.8139010" "4"         "4"        
[15] "0.3804234" "5"         "5"         "0.4923082" "6"         "6"         "0.9933775"
[22] "7"         "7"         "0.1740895" "8"         "8"         "0.8342808" "9"        
[29] "9"         "0.3958154" "10"        "10"        "0.9690561"
  • When I do: `unlist(strsplit(x, split = "\\s"))` I get: `[1] "\"\",\"myid\",\"var\"" [2] "\"1\",1,0.188206466846168" [3] "\"2\",2,0.749221669742838" [4] "\"3\",3,0.308469610055909" [5] "\"4\",4,0.379058901453391" [6] "\"5\",5,0.750765843084082" [7] "\"6\",6,0.70420167548582" [8] "\"7\",7,0.920593033544719" [9] "\"8\",8,0.795277312630787" [10] "\"9\",9,0.829076963244006" [11] "\"10\",10,0.167547662509605"`, which is not the desired output. – Krantz Mar 10 '19 at 14:13
  • The desired output is: `myid var 1 1 0.5949020 2 2 0.8515591 3 3 0.8139010 4 4 0.3804234 5 5 0.4923082 6 6 0.9933775 7 7 0.1740895 8 8 0.8342808 9 9 0.3958154 10 10 0.9690561` – Krantz Mar 10 '19 at 14:14
  • Don't `unlist`. Run `do.call(rbind, unlist(strsplit(dat, split = "\\s"))`. – Rui Barradas Mar 10 '19 at 14:19
  • When I do: `do.call(rbind, unlist(strsplit(x, split = "\\s")))` I get: `Error in do.call(rbind, unlist(strsplit(x, split = "\\s"))) : second argument must be a list`. – Krantz Mar 10 '19 at 14:21