Parsing JSON arrays from a .txt file in R - several large files

Question

I have recently been downloading large quantities of Tweets from Twitter. My starting point is around 400 .txt files containing Tweet IDs. After running a tool, Tweets are scraped from Twitter using the Tweet IDs and for every .txt file I had with a large list of Tweet IDs, I get a very large .txt file containing JSON strings. Each JSON string contains all of the information about the Tweet. Below is hyperlink to my one-drive, that contains the file I am working on (once I get this to work, I will apply the code to the other files):

https://1drv.ms/t/s!At39YLF-U90fhKAp9tIGJlMlU0qcNQ

I have been trying to parse each JSON string in each file but with no success. My aim is to convert each file into a large dataframe in R. Each row will be a Tweet and each column a feature in the Tweet. Given their nature, the 'text' column will be very large (it will contain the body of the tweet), whereas the 'location' will be short. Each JSON string is formatted in the same way and there can be up to a million strings per file.

I have tried several methods (shown below) to obtain what I need with no success:

library('RJSONIO')library('RCurl')
json_file <- fromJSON("Pashawar_test.txt")
json_file2 = RJSONIO::fromJSON(json_file)

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘fromJSON’ for signature ‘"list", "missing"’

My other attempt:

library('RJSONIO')
json_file <- fromJSON("Pashawar_test.txt")
text <- json_file[['text']]
idstr <- json_file[['id_str']]

This code seems to parse only the first JSON string in the file. I say this because when I attempt to select 'text' or 'id_str', I only get one instance. It's also worth pointing out that the 'json_file' is a large list that is 52.7mb in size, whereas the source file is 335mb.

score 2 · Answer 1 · answered Feb 07 '18 at 11:20

2

Try the stream_in function of the jsonlite package. Your file contains a JSON for each line. Either you read line by line and convert through fromJSON or you use directly stream_in, which is made for handling exactly this kind of files/connections.

require(jsonlite)
filepath<-"path/to/your/file"
#method A: read each line and convert
content<-readLines(filepath)
#this will take a while
res<-lapply(content,fromJSON)

#method B: use stream_in
con<-file(filepath,open="rt")
#this will take a while
res<-stream_in(con)

Notice that stream_in will also simplify the result, coercing it to a data.frame, which might be handier.

answered Feb 07 '18 at 11:20

nicola

24,005
3
35
56

nicola - thank you for your suggestion. I ran the code you suggested and it created 'res', which in the top-right window is a 'data' with 75008 observations of 30 variables. When I try to view this, the terminal says: "Error in View : 'names' attribute [1] must be the same length as the vector [0]". Can you suggest what I can do to overcome this? I am aiming to see each parson JSON string as a row, with each variable as a column. Then I can scroll down and begin to identify what I should keep and discard. – Christopher Loynes Feb 07 '18 at 11:46

score 1 · Answer 2 · answered Feb 07 '18 at 13:30

That's a [n]ewline [d]elimited [json] (ndjson) file which was tailor-made for the ndjson package. Said package is very measurably faster than jsonlite::stream_in() and produces a "completely flat" data frame. That latter part ("completely flat") isn't always what folks really need as it can make for a very wide structure (in your case 1,012 columns as it expanded all the nested components) but you get what you need fast without having to unnest anything on your own.

The output of str() or even glimpse() is too large to show here but this is how you use it.

NOTE that I renamed your file since .json.gz is generally how ndjson is stored (and my package can handle gzip'd json files):

library(ndjson)
library(tidyverse)

twdf <- tbl_df(ndjson::stream_in("~/Desktop/pashwar-test.json.gz"))
## dim(twdf)
## [1] 75008  1012

Having said that…

I was alternatively going to suggest using Apache Drill since you have many of these files and they're relatively big. Drill would let you (ultimately) convert these to parquet and significantly speed things up, and there's a package to interface with Drill (sergeant):

library(sergeant)
library(tidyverse)

db <- src_drill("dbserver")
twdf <- tbl(db, "dfs.json.`pashwar-test.json.gz`")

glimpse(twdf)
## Observations: 25
## Variables: 28
## $ extended_entities         <chr> "{\"media\":[]}", "{\"media\":[]}", "{\"m...
## $ quoted_status             <chr> "{\"entities\":{\"hashtags\":[],\"symbols...
## $ in_reply_to_status_id_str <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_id     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ created_at                <chr> "Tue Dec 16 10:13:47 +0000 2014", "Tue De...
## $ in_reply_to_user_id_str   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ source                    <chr> "<a href=\"http://twitter.com/download/an...
## $ retweeted_status          <chr> "{\"created_at\":\"Tue Dec 16 09:28:17 +0...
## $ quoted_status_id          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ retweet_count             <int> 220, 109, 9, 103, 0, 398, 0, 11, 472, 88,...
## $ retweeted                 <chr> "false", "false", "false", "false", "fals...
## $ geo                       <chr> "{\"coordinates\":[]}", "{\"coordinates\"...
## $ is_quote_status           <chr> "false", "false", "false", "false", "fals...
## $ in_reply_to_screen_name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ id_str                    <dbl> 5.447975e+17, 5.447975e+17, 5.447975e+17,...
## $ in_reply_to_user_id       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ favorite_count            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ id                        <dbl> 5.447975e+17, 5.447975e+17, 5.447975e+17,...
## $ text                      <chr> "RT @afneil: Heart-breaking beyond words:...
## $ place                     <chr> "{\"bounding_box\":{\"coordinates\":[]},\...
## $ lang                      <chr> "en", "en", "en", "en", "en", "en", "en",...
## $ favorited                 <chr> "false", "false", "false", "false", "fals...
## $ possibly_sensitive        <chr> NA, "false", NA, "false", NA, "false", NA...
## $ coordinates               <chr> "{\"coordinates\":[]}", "{\"coordinates\"...
## $ truncated                 <chr> "false", "false", "false", "false", "fals...
## $ entities                  <chr> "{\"user_mentions\":[{\"screen_name\":\"a...
## $ quoted_status_id_str      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ user                      <chr> "{\"id\":25968369,\"id_str\":\"25968369\"...

BUT

you've managed to create really inconsistent JSON. Not all fields with nested content are consistently represented that way and newcomers to Drill will find it somewhat challenging to craft bulletproof SQL that will help them unnest that data across all scenarios.

If you only need the data from the "already flat" bits, give Drill a try.

If you need the nested data and don't want to fight with unnesting from jsonlite::stream_in() or struggling with Drill unnesting, then, I'd suggest using ndjson as noted in the first example and then carve out the bits you really need into more manageable, tidy data frames.

Hrdrmstr - I do not know how to convert his to a .json.gz. I ran it by adjusting the name to the .txt file. What benefit is there changing it to a .json.gz file and if there is, how do I convert it? Also, now I can see the data, I only want to select specific features. Is there a way you would recommend doing this? — Christopher Loynes, Feb 07 '18 at 14:23
the conversion isn't really important (it just saves space and can speed up parsing). You need to find a `gzip` utility (I'm going to make an assumption you're on Windows, and while I know Windows it'd mean figuring out which place to send you to get gzip vs just point you to the built-in utilities in saner/better operating systems). Now that you have a data frame, just select out the columns like you would with any data frame. — hrbrmstr, Feb 07 '18 at 16:09

Parsing JSON arrays from a .txt file in R - several large files

2 Answers2

Linked