1

I have a URL that displays content (file IDs) from a certain website's API, in JSON format. To do this programmatically, I use the fromJSON(txt) function of the jsonlite package, which then parses the JSON into a vector (or list, not sure).

This works perfectly on my home computer. However, when I run the same identical code at work, it seems that the fromJSON(txt) doesn't identify the URL as such and rather tries to parse the actual URL text, since I get the following error:

 Error: lexical error: invalid char in json text.
                                       https://api.gdc.cancer.gov/file
                     (right here) ------^

I have checked and rechecked my code and the URL numerous times. The URL works perfectly when pasted into a browser and returns JSON formatted text.

I have tried several alternatives, such as the unserializeJSON() of jsonlite package and fromJSON() of the RJSONIO package, the latter of which produces a different error.

I would appreciate any help in working out what is wrong...

Here is the relevant part of my code:

# The URL (works fine in a browser):
urlForIDs <- "https://api.gdc.cancer.gov/files?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%0A%20%20%20%20%22op%22%3A%20%22and%22%2C%0A%20%20%20%20%22content%22%3A%20%5B%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22op%22%3A%20%22in%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22content%22%3A%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22field%22%3A%20%22cases.project.program.name%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22value%22%3A%20%22TCGA%22%0A%20%20%20%20%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22op%22%3A%20%22and%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22content%22%3A%20%5B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22op%22%3A%20%22in%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22content%22%3A%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22field%22%3A%20%22cases.project.disease_type%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22value%22%3A%20%22%2ACarcinoma%22%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22op%22%3A%20%22in%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22content%22%3A%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22field%22%3A%20%22cases.project.primary_site%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22value%22%3A%20%22Breast%22%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20%20%20%20%20%5D%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%5D%0A%7D%0A%2C%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22type%22%2C%22value%22%3A%22copy_number_segment%22%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22data_category%22%2C%22value%22%3A%22Copy%20Number%20Variation%22%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22data_type%22%2C%22value%22%3A%22Masked%20Copy%20Number%20Segment%22%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22experimental_strategy%22%2C%22value%22%3A%22Genotyping%20Array%22%7D%7D%5D%7D%5D%7D&fields=file_id&size=5000&related_files=false"

# Another URL which I tried, that does the same thing, but when creating this one I minimised the JSON (removed white spaces) before encoding it:
# The first one worked on Chrome browser but not in Explorer, this one does work in Explorer, but not in the fromJSON() function:
url2 <- "https://api.gdc.cancer.gov/files?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.program.name%22%2C%22value%22%3A%22TCGA%22%7D%7D%2C%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.disease_type%22%2C%22value%22%3A%22%2ACarcinoma%22%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.primary_site%22%2C%22value%22%3A%22Breast%22%7D%7D%5D%7D%5D%7D%2C%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22type%22%2C%22value%22%3A%22copy_number_segment%22%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22data_category%22%2C%22value%22%3A%22Copy%20Number%20Variation%22%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22data_type%22%2C%22value%22%3A%22Masked%20Copy%20Number%20Segment%22%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22experimental_strategy%22%2C%22value%22%3A%22Genotyping%20Array%22%7D%7D%5D%7D%5D%7D&fields=file_id&size=5000&related_files=false"

fileIDs <- fromJSON(urlForIDs) # I have tried changing parameters, such as 'simplifyVector = FALSE' but nothing seems to work.

# The following line is not executed, since the error happens before
fileIDs$data$hits$file_id

Perhaps the strangest thing is that the identical code, copied and pasted, worked fine on my home computer.

Thanks in advance.

Update: Trying to debug the problem, I have found that the issue is when the following function in the jsonlite package is reached, which seems to check whether there is a URL and otherwise treats it as JSON text. For some reason, it enters the "else" part... Here is the function:

function (txt, bigint_as_char = FALSE) 
{
    if (inherits(txt, "connection")) {
        parse_con(txt, bigint_as_char)
    }
    else {
        parse_string(txt, bigint_as_char)
    }
}

Update #2: I realised that when I paste the link into Microsoft Edge or Internet Explorer, some of the URL gets cut off and then I get a message that it is not a valid JSON. I changed the default settings to use Chrome as a default browser, since Chrome doesn't cut it off. But it still doesn't work... Could this be the problem? Any suggestions?

Final Update: I wrote to the creator of the package, Jeroen Ooms, who advised me to download the package from GitHub since the problem was fixed there. This was more than a year ago, so I imagine by now the package doesn't have this problem also when downloading from CRAN. Thanks to all who replied!

Malka
  • 51
  • 9
  • it enters the 'else' part because the txt is not a `connection` object, so it's doing the correct thing in that instance – SymbolixAU Mar 09 '18 at 01:07
  • @SymbolixAU That's exactly the problem... It is a URL, and jsonlite's fromJSON() function can get a URL and use it to retrieve data and then read the JSON the URL returns. But for some reason on my work computer it doesn't identify the URL as such, and tries to read it as a JSON rather than use it to retrieve data and then read that data as a JSON, hence the error. I am trying to figure out which internet settings I would need to change on my work computer to get this to work like on my PC. Is there a way to automatically enable longer URLs? – Malka Mar 11 '18 at 23:55
  • 1
    I've seen issues with using `jsonlite` behind firewalls - does your company use a firewall as this could be the source of the problem? – SymbolixAU Mar 12 '18 at 00:04

2 Answers2

0

You can read the text directly from the URL using readLines, manually assign it a 'json' class, then use jsonlite to convert to an R object.

Note: You'll get a couple of warnings about incomplete end-of-line

res <- readLines(urlForIDs)
res2 <- readLines(url2)

class(res) <- "json"
class(res2) <- "json"

## View the raw JSON
jsonlite::prettify(res)
jsonlite::prettify(res2)

## convert to data.frame
df <- jsonlite::fromJSON(res)
df2 <- jsonlite::fromJSON(res2)

str(df)
# List of 2
# $ data    :List of 2
# ..$ hits      :'data.frame':  2223 obs. of  2 variables:
#   .. ..$ file_id: chr [1:2223] "2f22c96a-7b69-4e9c-96ac-be58fc2a79f1" "38d7d00a-594d-4bdc-a34c-660bfc195ff0" "03596a48-d4d1-4d8e-b76b-75fe8c0f0b75" "6bfe38b2-f0bb-4a79-83fd-b0c18c0f6a79" ...
# .. ..$ id     : chr [1:2223] "2f22c96a-7b69-4e9c-96ac-be58fc2a79f1" "38d7d00a-594d-4bdc-a34c-660bfc195ff0" "03596a48-d4d1-4d8e-b76b-75fe8c0f0b75" "6bfe38b2-f0bb-4a79-83fd-b0c18c0f6a79" ...
# ..$ pagination:List of 7
# .. ..$ count: int 2223
# .. ..$ sort : chr ""
# .. ..$ from : int 0
# .. ..$ page : int 1
# .. ..$ total: int 2223
# .. ..$ pages: int 1
# .. ..$ size : int 5000
# $ warnings: Named list()


str(df2)
# List of 2
# $ data    :List of 2
# ..$ hits      :'data.frame':  2223 obs. of  2 variables:
#   .. ..$ file_id: chr [1:2223] "2f22c96a-7b69-4e9c-96ac-be58fc2a79f1" "38d7d00a-594d-4bdc-a34c-660bfc195ff0" "03596a48-d4d1-4d8e-b76b-75fe8c0f0b75" "6bfe38b2-f0bb-4a79-83fd-b0c18c0f6a79" ...
# .. ..$ id     : chr [1:2223] "2f22c96a-7b69-4e9c-96ac-be58fc2a79f1" "38d7d00a-594d-4bdc-a34c-660bfc195ff0" "03596a48-d4d1-4d8e-b76b-75fe8c0f0b75" "6bfe38b2-f0bb-4a79-83fd-b0c18c0f6a79" ...
# ..$ pagination:List of 7
# .. ..$ count: int 2223
# .. ..$ sort : chr ""
# .. ..$ from : int 0
# .. ..$ page : int 1
# .. ..$ total: int 2223
# .. ..$ pages: int 1
# .. ..$ size : int 5000
# $ warnings: Named list()

Also, watch out for the length of the URL

SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
  • Thanks! But the whole point is that the fromJSON() function is meant to be able to identify a URL, use it to retrieve data, and then read from the JSON which the URL returns. It works great on my PC, but for some reason on the work computer it doesn't identify the URL as such and therefore treats it like text, in which case it tries to read it like a JSON, and then throws an error because it cannot parse the URL as though it were a JSON. I think you are right that it has to do with the length of the URL, but on my PC it works & I can't figure out which settings to change to sort that limit out – Malka Mar 11 '18 at 23:49
  • This is a very common issue. Likely reason is when the webpage does not return a valid json fromJSON() reverts to trying to read the input as a character string. – Lazarus Thurston Feb 28 '21 at 12:42
0

Problem fixed (a year ago, but sharing in case it helps anyone else).

I wrote to the creator of the package, Jeroen Ooms, who advised me to download the package from GitHub since the problem was fixed there. This was more than a year ago, so I imagine by now the standard package doesn't have this problem also when downloading from CRAN.

To download from GitHub:

devtools::install_github("jeroen/jsonlite")
Malka
  • 51
  • 9