0

Can't find any solution how to load huge JSON. I try with well known Yelp dataset. It's 3.2 GB and I want to analyse 9 out of 10 columns. I need to skip import $text column, which will give me much lighter file to load. Probably about -70%. I don't want to manipulate the file.

I tried many libraries and stuck. I've found a solution for data.frame to apply pipe function:

df <- read.table(pipe("cut -f1,5,28 myFile.txt")) from this thread: Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?)

How to do it for JSON? I'd like to do:

json <- read.table(pipe("cut -text yelp_academic_dataset_review.json"))

but this of course throws an error due to wrong format. Is there any possibilities without parsing whole file with regex?

EDIT

Structure of one row: (even can't count them all)

{"review_id":"vT2PALXWX794iUOoSnNXSA","user_id":"uKGWRd4fONB1cXXpU73urg","business_id":"D7FK-xpG4LFIxpMauvUStQ","stars":1,"date":"2016-10-31","text":"some long text here","useful":0,"funny":0,"cool":0,"type":"review"}

SECOND EDIT

Finally, I've created a loop to convert all data into csv file, omitted unwanted column. It's slow but I've got 150 mb (zipped) from 3.2 gb.

# files to process
  filepath <- jfile1
  fcsv <- "c:/Users/cp/Documents/R DATA/Yelp/tests/reviews.csv"
  write.table(x = paste(colns, collapse=","), file = fcsv, quote = F, row.names = F, col.names = F)

  con = file(filepath, "r")

  while ( TRUE ) {
    line = readLines(con, n = 1)
    if ( length(line) == 0 ) {
      break
    }
    # regex process
    d <- NULL
    for (i in rcols) {
      pst <- paste(".*\"",colns[i],"\":*(.*?) *,\"",colns[i+1],"\":.*", sep="")
      w <- sub(pst, "\\1", line)
      d <- cbind(d, noquote(w))
    }
    # save on the fly
    write.table(x = paste(d, collapse = ","), file = fcsv, append = T, quote = F, row.names = F, col.names = F)

  }
  close(con)

It can be save to json also. I wonder if it's the most efficient way, but other scripts I tested was slow and often had some encoding issues.

Community
  • 1
  • 1
Peter.k
  • 1,475
  • 23
  • 40
  • How does the json look? Is it pretty printed with indents and line breaks by each key/value pair or all one long line? – Parfait Jan 26 '17 at 01:01

1 Answers1

0

Try this:

library(jsonlite)

df <- as.data.frame(fromJSON('yelp_academic_dataset_review.json', flatten=TRUE))

Then once it is a dataframe delete the column(s) you don't need.

If you don't want to manipulate the file in advance of importing it, I'm not sure what options you have in R. Alternatively, you could make a copy of the file, then delete the text column with this script, then import the copy to R, then delete the copy.

Community
  • 1
  • 1
iskandarblue
  • 7,208
  • 15
  • 60
  • 130
  • Thanks, but flatten version does the same, because structure is flat. Alternative is good but it's an alternative. Maybe the only solution. – Peter.k Jan 26 '17 at 00:51
  • Maybe another one exists. Is it taking so long to load the file ? You could just do it once, then delete the column, the write a new json with `toJSON` for future use. – iskandarblue Jan 26 '17 at 00:57
  • It's infinite load. No way to see results. And worse, it's not the biggest file I want to load, so I'm thinking now about writing my json package to read and manipulate the data during process in some effective way. Of course I I give up I'll reformat the file. – Peter.k Jan 26 '17 at 01:07
  • I don't really want to download the answer, but given a json with 1000 fields of which only 2 are interesting, your solution is inefficient at best. – Ufos Aug 02 '18 at 15:55