0

I am pretty new with R. Trying to solve one problem already the entire day. Unfortunately I couldn´t solve it.

I´d like to import a JSON file in R and then have the opportunity to further process it in the same way as when I am importing a CSV file.

My JSON file has to following structure:

{ "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "helpful": [2, 3],
  "reviewText": "I bought this for my husband who plays the piano.
  He is having a wonderful time playing these old hymns. The music is at
  times hard to read because we think the book was published for singing
  from more than playing from. Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
}

I´d like to import the JSON file and then have a table that consist of 9 columns (reviewerID, asin, reviewerName, etc.).

I tried it with the R package jsonlite, but if I do so I get the following error message:

 data <- fromJSON('reviews_Office_Products.json.gz2')
 Error in feed_push_parser(buf) : parse error: trailing garbage
      "reviewTime": "07 19, 2013"} {"reviewerID": "A3BBNK2R5TUYGV"
                 (right here) ------^

Do you have any idea who I can accomplish my undertaking?

Thank you very much in advance.

Best regards Paul

Paul
  • 13
  • 6
  • Most likely an error in your JSON. Try `validate` function in the `jsonlite` package to check if it is valid JSON. Looks like a missing comma if that print out is correct (between records). – mattdevlin Aug 22 '15 at 17:41
  • 1
    I had the same thought. When I am applying the [validate] function I get following error: Error: is.character(txt) is not TRUE Where exactly do you see that the JSON format is not correct? – Paul Aug 22 '15 at 18:23
  • Check in a text editor if there is a comma between the record with "reviewerID": "A3BBNK2R5TUYGV" and the one before it - the error message in your post suggests there isn't, but that could just be because the message chooses not to display it. – mattdevlin Aug 22 '15 at 18:28
  • I checked the entries in the text editor. There are commas. Do you have any idea how to process further with this json file? Shall I try it to convert it in python to a csv file and then import it to R? What is the best solution? – Paul Aug 22 '15 at 18:41
  • Take a look at this and see if it's the same issue http://stackoverflow.com/questions/26519455/error-parsing-json-file-with-the-jsonlite-package – mattdevlin Aug 22 '15 at 18:46
  • I already tried this aswell. It seems that there is a problem with the comma in the field helpful (eg. [2, 3]). – Paul Aug 22 '15 at 19:26
  • Try http://jsonlint.com/. It might give you a more helpful error message. – mattdevlin Aug 22 '15 at 19:36
  • Thanks for the link. When I enter one data set the website says that the JSON is valid. When I enter two data sets I get the following error msg: Parse error on line 14: ...me": "07 19, 2013"}{ "reviewerID": ----------------------^ Expecting 'EOF', '}', ',', ']'. – Paul Aug 22 '15 at 19:44
  • Are you 100% sure there is a comma between the curly braces for example `me": "07 19, 2013"}, { "reviewerID":`? Both error messages seem to be pointing towards that - there is no comma between the curlys in the error messages. Also you should check the JSON file as a whole in jsonlint, the segment in your post looks fine so will pass. – mattdevlin Aug 22 '15 at 19:53
  • No between the curly braces there is NO comma. I can´t check the file as a whole as it has more than 300 MB. – Paul Aug 22 '15 at 19:57
  • I'm almost sure that's your issue. Take a look at these examples http://json.org/example.html. Each one of the `{ blocks }` represent a record and should be separated by commas. I guess they are all enclosed in a pair of square brackets in your file too? Try loading the file as a string and do `gsub("}{", "},{", json_string)` and then try using `validate` or `fromJSON` on that. You can use `readlines` to read in the file. – mattdevlin Aug 22 '15 at 20:01

2 Answers2

1

finally I did it as follows:

library(rjson)
url <- "reviews_Office_Products.json.gz2"
con = file(url, "r")
input <- readLines(con, -1L)
my_results <- lapply(X=input,fromJSON)

close(con)
tr.review <- ldply(lapply(input, function(x) t(unlist(fromJSON(x)))))
save(tr.review, file= 'tr.review.rdata')

For my purposes this works and I can further process the data with the tm-package.

Thank you very much for your help. Paul

Paul
  • 13
  • 6
0

This works. You might need to play around with the regular expression to make it fit. Note that double instead of single backslashes are needed in R regexes.

library(rjson)
library(magrittr)
library(dplyr)
library(lubridate)
library(stringi)

options(stringsAsFactors = FALSE)

'{ "reviews": [ { "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "helpful": [2, 3],
  "reviewText": "I bought this for my husband who plays the piano.
  He is having a wonderful time playing these old hymns. The music is at
  times hard to read because we think the book was published for singing
  from more than playing from. Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
} { "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "helpful": [2, 3],
  "reviewText": "I bought this for my husband who plays the piano.
  He is having a wonderful time playing these old hymns. The music is at
  times hard to read because we think the book was published for singing
  from more than playing from. Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
} ] }' %>%
  writeLines("reviews_Office_Products.json.gz2")

data = 
  "reviews_Office_Products.json.gz2" %>%
  readLines %>%
  stri_replace_all_regex("\\}[ \\n]*\\{", "},{") %>%
  paste(collapse = "\n") %>%
  fromJSON %>%
  .[[1]] %>%
  lapply(as.data.frame) %>%
  bind_rows %>%
  select(-unixReviewTime) %>%
  mutate(asin = as.numeric(asin),
         reviewTime = mdy(reviewTime) )

review = 
  data %>%
  select(-helpful) %>%
  distinct

review__helpful =
  data %>%
  select(reviewerID, helpful) %>%
  distinct
bramtayl
  • 4,004
  • 2
  • 11
  • 18
  • Thank you very much. I will try it as soon as I have the time. How can I connect these code with my JSON file? As far as I understand the code you work with the JSON content that you have entered above. – Paul Aug 25 '15 at 19:28
  • The first section is just generating a file that I think might be similar to yours. Everything after data = should work for you. Like I said, it might not work, particularly if I guessed the structure of your JSON file incorrectly or if the regex doesn't solve the missing comma issue (it inserts a comma between } and { separated by only spaces and newlines). Particularly, the lapply as.data.frame will not work if you have the potential for vectors in any column besides helpful. I also am hoping that reviewerID is a unique ID such that you can maintain the link between the review and helpful. – bramtayl Aug 25 '15 at 19:57
  • I am already very desperate. Tried already several workarounds, but so far nothing worked :(. When trying it with your code I get the following error message after entering data = ...: Error in eval(expr, envir, enclos) : object 'unixReviewTime' not found...any ideas? – Paul Aug 27 '15 at 16:47
  • The %>% command appends a command onto the previous. So the easiest way to debug the statement would be to run the command starting at data = and adding a new line each time and check to see if the resulting data object is what you expect. Key places are at the end of fromJSON (so you can look at the list structure) and at the end of bind_rows (so you can look at the dataframe). – bramtayl Aug 27 '15 at 19:13