How to convert the output from readLines into a dataframe

Question

I am attempting to use readLines to import a 17.6GB csv file into R. I have tried several approaches discussed here, here, here, and elsewhere and readLines seems to be the only approach that effectively at least can get the data into R.

The problem is that I am unable to convert the output from readLines into a data frame which I can use in my analysis. The answers to a related question here are not helping me solve my problem.

Here is my sample data:

write.csv(data.frame(myid=1:10,var=runif(10)),"temp.csv")

dt<-data.frame(myid=1:10,var=runif(10))
dt

myid       var
1     1 0.5949020
2     2 0.8515591
3     3 0.8139010
4     4 0.3804234
5     5 0.4923082
6     6 0.9933775
7     7 0.1740895
8     8 0.8342808
9     9 0.3958154
10   10 0.9690561

creating chunks:

file_i <- file("temp.csv","r")
chunk_size <- 100000 # choose the best size for you
x<- readLines(file_in, n=chunk_size)

Opening the output from readLines in R:

View(x)
x
 [1] "\"\",\"myid\",\"var\""      
 [2] "\"1\",1,0.594902001088485"  
 [3] "\"2\",2,0.851559089729562"  
 [4] "\"3\",3,0.81390100880526"   
 [5] "\"4\",4,0.380423351423815"  
 [6] "\"5\",5,0.492308202432469"  
 [7] "\"6\",6,0.993377464590594"  
 [8] "\"7\",7,0.174089450156316"  
 [9] "\"8\",8,0.834280799608678"  
[10] "\"9\",9,0.395815373631194"  
[11] "\"10\",10,0.969056134112179"

Thanks in advance for any help

I have >50,000,000 rows. Is it possible to have a function that implements your suggested idea and then merges the resulting chunks by row? Thanks in advance for any help. — Krantz, Mar 10 '19 at 14:00
Yes. I have tried `data.table::fread`. No success. Also, have tried: `bigmemory::read.big.matrix`, and `sqldf::read.csv.sql`. All of them return: `Error: cannot allocate vector of size XX Mb`. — Krantz, Mar 10 '19 at 14:05
Please check this question https://stackoverflow.com/questions/5171593/r-memory-management-cannot-allocate-vector-of-size-n-mb — pogibas, Mar 10 '19 at 14:06
I would go for some command line pre-processing like `sed` or `awk` to clean up input, then fread. — jangorecki, Mar 11 '19 at 06:07
instead of `View(x)` use `head tmp.csv` from command line so we have clear picture what the csv is like, and not how it is interpreted by R function `View` — jangorecki, Mar 11 '19 at 06:10

score 3 · Accepted Answer · answered Mar 10 '19 at 14:27

3

Here is a complete sequence of instructions to transform the data as you posted into a dataframe.

set.seed(1234)    # Make the results reproducible

write.csv(data.frame(myid=1:10,var=runif(10)),"temp.csv")

dat <- readLines("temp.csv")
df1 <- strsplit(dat[-1], ",")
df1 <- do.call(rbind, df1)
df1 <- df1[,-1]
df1 <- as.data.frame(df1)
df1[] <- lapply(df1, function(x) as.numeric(as.character(x)))

names(df1) <- gsub('"', '', strsplit(dat[1], ',')[[1]][-1], fixed = TRUE)
df1

answered Mar 10 '19 at 14:27

Rui Barradas

70,273
8
34
66

I'm frankly amazed this works and `fread` does not. What version of `data.table` are you using? – MichaelChirico Mar 10 '19 at 16:40
2

`do.call` `rbind` will be very slow for so many rows, better to switch to `data.table::rbindlist` here – jangorecki Mar 11 '19 at 06:06

score 0 · Answer 2 · 2019-03-10T14:32:01.670

Given the output you get after readLines, this must be the content of your CSV file:

"","myid","var"
"1","1","0.5949020"
"2","2","0.8515591"
"3","3","0.8139010"
"4","4","0.3804234"
"5","5","0.4923082"
"6","6","0.9933775"
"7","7","0.1740895"
"8","8","0.8342808"
"9","9","0.3958154"
"10","10","0.9690561"

That is, your values are comma separated and enclosed in double quotes. When I read in this file, I get your output:

dat
 [1] "\"\",\"myid\",\"var\""       "\"1\",\"1\",\"0.5949020\""  
 [3] "\"2\",\"2\",\"0.8515591\""   "\"3\",\"3\",\"0.8139010\""  
 [5] "\"4\",\"4\",\"0.3804234\""   "\"5\",\"5\",\"0.4923082\""  
 [7] "\"6\",\"6\",\"0.9933775\""   "\"7\",\"7\",\"0.1740895\""  
 [9] "\"8\",\"8\",\"0.8342808\""   "\"9\",\"9\",\"0.3958154\""  
[11] "\"10\",\"10\",\"0.9690561\""

So what you need to do is

split at the commas
with unlist(strsplit(..., split = ",")

and

replace the escaped double quotes
with gsub("\"", "", ...)

which gives us:

unlist(strsplit(gsub("\"", "", dat), split = ","))

 [1] ""          "myid"      "var"       "1"         "1"         "0.5949020" "2"        
 [8] "2"         "0.8515591" "3"         "3"         "0.8139010" "4"         "4"        
[15] "0.3804234" "5"         "5"         "0.4923082" "6"         "6"         "0.9933775"
[22] "7"         "7"         "0.1740895" "8"         "8"         "0.8342808" "9"        
[29] "9"         "0.3958154" "10"        "10"        "0.9690561"

When I do: `unlist(strsplit(x, split = "\\s"))` I get: `[1] "\"\",\"myid\",\"var\"" [2] "\"1\",1,0.188206466846168" [3] "\"2\",2,0.749221669742838" [4] "\"3\",3,0.308469610055909" [5] "\"4\",4,0.379058901453391" [6] "\"5\",5,0.750765843084082" [7] "\"6\",6,0.70420167548582" [8] "\"7\",7,0.920593033544719" [9] "\"8\",8,0.795277312630787" [10] "\"9\",9,0.829076963244006" [11] "\"10\",10,0.167547662509605"`, which is not the desired output. — Krantz, Mar 10 '19 at 14:13
The desired output is: `myid var 1 1 0.5949020 2 2 0.8515591 3 3 0.8139010 4 4 0.3804234 5 5 0.4923082 6 6 0.9933775 7 7 0.1740895 8 8 0.8342808 9 9 0.3958154 10 10 0.9690561` — Krantz, Mar 10 '19 at 14:14
Don't `unlist`. Run `do.call(rbind, unlist(strsplit(dat, split = "\\s"))`. — Rui Barradas, Mar 10 '19 at 14:19
When I do: `do.call(rbind, unlist(strsplit(x, split = "\\s")))` I get: `Error in do.call(rbind, unlist(strsplit(x, split = "\\s"))) : second argument must be a list`. — Krantz, Mar 10 '19 at 14:21

How to convert the output from readLines into a dataframe

2 Answers2