Can't find any solution how to load huge JSON. I try with well known Yelp dataset. It's 3.2 GB and I want to analyse 9 out of 10 columns. I need to skip import $text
column, which will give me much lighter file to load. Probably about -70%. I don't want to manipulate the file.
I tried many libraries and stuck. I've found a solution for data.frame
to apply pipe
function:
df <- read.table(pipe("cut -f1,5,28 myFile.txt"))
from this thread: Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?)
How to do it for JSON? I'd like to do:
json <- read.table(pipe("cut -text yelp_academic_dataset_review.json"))
but this of course throws an error due to wrong format. Is there any possibilities without parsing whole file with regex?
EDIT
Structure of one row: (even can't count them all)
{"review_id":"vT2PALXWX794iUOoSnNXSA","user_id":"uKGWRd4fONB1cXXpU73urg","business_id":"D7FK-xpG4LFIxpMauvUStQ","stars":1,"date":"2016-10-31","text":"some long text here","useful":0,"funny":0,"cool":0,"type":"review"}
SECOND EDIT
Finally, I've created a loop to convert all data into csv file, omitted unwanted column. It's slow but I've got 150 mb (zipped) from 3.2 gb.
# files to process
filepath <- jfile1
fcsv <- "c:/Users/cp/Documents/R DATA/Yelp/tests/reviews.csv"
write.table(x = paste(colns, collapse=","), file = fcsv, quote = F, row.names = F, col.names = F)
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
# regex process
d <- NULL
for (i in rcols) {
pst <- paste(".*\"",colns[i],"\":*(.*?) *,\"",colns[i+1],"\":.*", sep="")
w <- sub(pst, "\\1", line)
d <- cbind(d, noquote(w))
}
# save on the fly
write.table(x = paste(d, collapse = ","), file = fcsv, append = T, quote = F, row.names = F, col.names = F)
}
close(con)
It can be save to json also. I wonder if it's the most efficient way, but other scripts I tested was slow and often had some encoding issues.