That's a [n]ewline [d]elimited [json] (ndjson) file which was tailor-made for the ndjson
package. Said package is very measurably faster than jsonlite::stream_in()
and produces a "completely flat" data frame. That latter part ("completely flat") isn't always what folks really need as it can make for a very wide structure (in your case 1,012 columns as it expanded all the nested components) but you get what you need fast without having to unnest anything on your own.
The output of str()
or even glimpse()
is too large to show here but this is how you use it.
NOTE that I renamed your file since .json.gz
is generally how ndjson is stored (and my package can handle gzip'd json files):
library(ndjson)
library(tidyverse)
twdf <- tbl_df(ndjson::stream_in("~/Desktop/pashwar-test.json.gz"))
## dim(twdf)
## [1] 75008 1012
Having said that…
I was alternatively going to suggest using Apache Drill since you have many of these files and they're relatively big. Drill would let you (ultimately) convert these to parquet and significantly speed things up, and there's a package to interface with Drill (sergeant
):
library(sergeant)
library(tidyverse)
db <- src_drill("dbserver")
twdf <- tbl(db, "dfs.json.`pashwar-test.json.gz`")
glimpse(twdf)
## Observations: 25
## Variables: 28
## $ extended_entities <chr> "{\"media\":[]}", "{\"media\":[]}", "{\"m...
## $ quoted_status <chr> "{\"entities\":{\"hashtags\":[],\"symbols...
## $ in_reply_to_status_id_str <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ created_at <chr> "Tue Dec 16 10:13:47 +0000 2014", "Tue De...
## $ in_reply_to_user_id_str <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ source <chr> "<a href=\"http://twitter.com/download/an...
## $ retweeted_status <chr> "{\"created_at\":\"Tue Dec 16 09:28:17 +0...
## $ quoted_status_id <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ retweet_count <int> 220, 109, 9, 103, 0, 398, 0, 11, 472, 88,...
## $ retweeted <chr> "false", "false", "false", "false", "fals...
## $ geo <chr> "{\"coordinates\":[]}", "{\"coordinates\"...
## $ is_quote_status <chr> "false", "false", "false", "false", "fals...
## $ in_reply_to_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ id_str <dbl> 5.447975e+17, 5.447975e+17, 5.447975e+17,...
## $ in_reply_to_user_id <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ favorite_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ id <dbl> 5.447975e+17, 5.447975e+17, 5.447975e+17,...
## $ text <chr> "RT @afneil: Heart-breaking beyond words:...
## $ place <chr> "{\"bounding_box\":{\"coordinates\":[]},\...
## $ lang <chr> "en", "en", "en", "en", "en", "en", "en",...
## $ favorited <chr> "false", "false", "false", "false", "fals...
## $ possibly_sensitive <chr> NA, "false", NA, "false", NA, "false", NA...
## $ coordinates <chr> "{\"coordinates\":[]}", "{\"coordinates\"...
## $ truncated <chr> "false", "false", "false", "false", "fals...
## $ entities <chr> "{\"user_mentions\":[{\"screen_name\":\"a...
## $ quoted_status_id_str <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ user <chr> "{\"id\":25968369,\"id_str\":\"25968369\"...
BUT
you've managed to create really inconsistent JSON. Not all fields with nested content are consistently represented that way and newcomers to Drill will find it somewhat challenging to craft bulletproof SQL that will help them unnest that data across all scenarios.
If you only need the data from the "already flat" bits, give Drill a try.
If you need the nested data and don't want to fight with unnesting from jsonlite::stream_in()
or struggling with Drill unnesting, then, I'd suggest using ndjson
as noted in the first example and then carve out the bits you really need into more manageable, tidy data frames.