Is there any way to transform this data format into CSV?

Question

I have a large amount of extracted Json file with a format attached. I want to know if there is any way to convert it into CSV with column as the feature and values in the row.

{"state": "New Jersey", "text": "RT @joncoopertweets: Register to join the #WeThePeopleMarch on September 21st in Washington, D.C. \u2014 or one of the 50+ marches that will be\u2026", "has_emoji": false, "created_at": "Mon Sep 02 16:32:05 +0000 2019", "id": 1168562246349467649, "entities": {"hashtags": [{"text": "WeThePeopleMarch", "indices": [42, 59]}], "urls": [], "user_mentions": [{"screen_name": "joncoopertweets", "name": "Jon Cooper", "id": 27493883, "id_str": "27493883", "indices": [3, 19]}], "symbols": []}, "source": "Twitter for iPad", "location": "Leonia, NJ", "verified": false, "geocode": null}
{"state": "Indiana", "text": "RT @dariusherron1: Don\u2019t nobody love they girl like Mexicans ", "has_emoji": false, "created_at": "Mon Sep 02 16:32:05 +0000 2019", "id": 1168562246378827776, "entities": {"hashtags": [], "urls": [{"url": "", "expanded_url": "", "display_url": "", "indices": [61, 84]}], "user_mentions": [{"screen_name": "dariusherron1", "name": "Darius Herron", "id": 1680891876, "id_str": "1680891876", "indices": [3, 17]}], "symbols": []}, "source": "Twitter for iPhone", "location": "Indianapolis, IN", "verified": false, "geocode": null}

See `?stream_in` from *jsonlite* as well, as used here in an old answer of mine: https://stackoverflow.com/a/31575437/496803 - This will convert straight to a `data.frame` — thelatemail, Sep 08 '19 at 23:03
The entries for `entities` are nested; what is your expected output for that key? — Maurits Evers, Sep 09 '19 at 00:11
This looks like data from the Twitter API? In which case I recommend using the [rtweet](https://rtweet.info/) package, which will take care of conversion to a data frame for you. — neilfws, Sep 09 '19 at 00:30

score 1 · Accepted Answer · answered Sep 09 '19 at 01:08

I'm not entirely clear on your expected output (see the comments and discussions to @user5783745's answer). Your JSON strings contain some nested objects which will give rise to a nested list structure if you use jsonlite::fromJSON. Since you don't provide matching expected output for the sample data you give, there may be different ways to handle these nested entries.

A possibility is to parse the JSON strings, and then flatten the resulting list twice before binding the rows.

library(tidyverse)
library(jsonlite)
map(json, ~fromJSON(.x) %>% flatten() %>% flatten()) %>% bind_rows()
## A tibble: 2 x 15
#  state text  has_emoji created_at     id indices screen_name name  id_str
#  <chr> <chr> <lgl>     <chr>       <dbl> <list>  <chr>       <chr> <chr>
#1 New … WeTh… FALSE     Mon Sep 0… 2.75e7 <int [… joncoopert… Jon … 27493…
#2 Indi… "RT … FALSE     Mon Sep 0… 1.68e9 <int [… dariusherr… Dari… 16808…
## … with 6 more variables: source <chr>, location <chr>, verified <lgl>,
##   url <chr>, expanded_url <chr>, display_url <chr>

The resulting object is a tibble with some list columns. To store as a CSV you could then exclude those list columns.

Sample data

json <- c(
    '{"state": "New Jersey", "text": "RT @joncoopertweets: Register to join the #WeThePeopleMarch on September 21st in Washington, D.C. \u2014 or one of the 50+ marches that will be\u2026", "has_emoji": false, "created_at": "Mon Sep 02 16:32:05 +0000 2019", "id": 1168562246349467649, "entities": {"hashtags": [{"text": "WeThePeopleMarch", "indices": [42, 59]}], "urls": [], "user_mentions": [{"screen_name": "joncoopertweets", "name": "Jon Cooper", "id": 27493883, "id_str": "27493883", "indices": [3, 19]}], "symbols": []}, "source": "Twitter for iPad", "location": "Leonia, NJ", "verified": false, "geocode": null}',
    '{"state": "Indiana", "text": "RT @dariusherron1: Don\u2019t nobody love they girl like Mexicans ", "has_emoji": false, "created_at": "Mon Sep 02 16:32:05 +0000 2019", "id": 1168562246378827776, "entities": {"hashtags": [], "urls": [{"url": "", "expanded_url": "", "display_url": "", "indices": [61, 84]}], "user_mentions": [{"screen_name": "dariusherron1", "name": "Darius Herron", "id": 1680891876, "id_str": "1680891876", "indices": [3, 17]}], "symbols": []}, "source": "Twitter for iPhone", "location": "Indianapolis, IN", "verified": false, "geocode": null}')

@user5783745 Not very elegant;-) I think neilfws' suggestion to use `rtweet` for handling the conversions is the way to go. The post you dug up on collapsing nested lists is very useful though in more general scenarios. — Maurits Evers, Sep 09 '19 at 01:14
Perhaps `rtweet` for this case, but a reliable and generalised way of converting randomly shaped lists into data.frames would be very helpful. So much web data is in json these days but so much of the analysis tools in R focus on dataframes. The conversion is very important. I have learned some nice tricks today — stevec, Sep 09 '19 at 01:17

stevec · Answer 2 · 2019-09-09T00:06:14.510

0

You can easily convert it into a data format that's easier to work with (a list), but what you do with it thereafter is up to you. In this case the list of data doesn't automatically turn into a data.frame - you would have to think about how to convert it (given some list items are single items whereas other list items are themselves data.frames

a <- '{"state": "New Jersey", "text": "RT @joncoopertweets: Register to join the #WeThePeopleMarch on September 21st in Washington, D.C. \u2014 or one of the 50+ marches that will be\u2026", "has_emoji": false, "created_at": "Mon Sep 02 16:32:05 +0000 2019", "id": 1168562246349467649, "entities": {"hashtags": [{"text": "WeThePeopleMarch", "indices": [42, 59]}], "urls": [], "user_mentions": [{"screen_name": "joncoopertweets", "name": "Jon Cooper", "id": 27493883, "id_str": "27493883", "indices": [3, 19]}], "symbols": []}, "source": "Twitter for iPad", "location": "Leonia, NJ", "verified": false, "geocode": null}' 

library(jsonlite)
library(dplyr)
a <- a %>% fromJSON 

new_dataframe <- data.frame(state=character(), 
                                text=character(), 
                                has_emoji=character(), 
                                id=character(), 
                                entities=character(), stringsAsFactors = FALSE)


new_dataframe[1, ] <- c(a$state, a$text, a$has_emoji, a$created_at, a$id)

edited Sep 09 '19 at 00:06

answered Sep 08 '19 at 22:52

stevec

41,291
27
223
311

`Error: ... arguments imply differing number of rows: 1, 0` – r2evans Sep 08 '19 at 23:55
@r2evans I thought I deleted the as.data.frame bit as it obviously won’t work, data must manually be assigned from the list to a data.frame – stevec Sep 08 '19 at 23:56
@ylxmm320 this answer should do what you're after – stevec Sep 09 '19 at 00:06
@r2evans I put the code in to explicitly convert from the list to dataframe – stevec Sep 09 '19 at 00:07
@user5783745 Not my down vote but I'm not sure this is what OP is after. Hard-coding the process of extracting entries and storing them in a `data.frame` seems very cumbersome. There are also nested entries in the parsed JSON file that you have omitted (again not sure what OP wants). A minor (and somewhat nitpicking) comment: In R a `data.frame` is just a `list` with some "extra features", so I wouldn't say that a `data.frame` is easier to work with than a `list`. In fact, all methods that work on `list`s will work on `data.frame`s as well. – Maurits Evers Sep 09 '19 at 00:23
@MauritsEvers check out the excel pic the OP posted, I think it's what they're. This answer does it for one JSON glob / data.frame row, they'd have to loop through each glob/row – stevec Sep 09 '19 at 00:25
@MauritsEvers there are simple ways of transforming lists into `data.frames` but check out `a$entities %>% class # [1] list` things could get weird when a list is coerced into a dataframe element. – stevec Sep 09 '19 at 00:26
@user5783745 The screenshot contains a column with ellipsis which I took to denote further columns. Otherwise this is rather trivial. The additional difficulty from the `list` to `data.frame` conversion stems from the nested elements in the `list`s. – Maurits Evers Sep 09 '19 at 00:26
@MauritsEvers if that's what OP wants I think they have to think about how to transform the nested list into a df? As in, I don't think there's just a standard way of doing so? I would be interested to know if there is – stevec Sep 09 '19 at 00:27
@user5783745 It would help if OP were to clarify on the expected output. Best to include matching output for the sample data he gives in a reproducible format (not as a screenshot). – Maurits Evers Sep 09 '19 at 00:30
@MauritsEvers I think using a screenshot is okay, but OP must specify how they want at least one complete row – stevec Sep 09 '19 at 00:30
1

@user5783745 No, a screenshot is [never a good idea](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors). We actively discourage (new) users from posting screenshots of code/data/errors here on SO. It makes it difficult/impossible to extract data and may break screen readers on many mobile devices. Also as the lengthy comments here demonstrate they may give rise to unclear or ambiguous problem statements. – Maurits Evers Sep 09 '19 at 00:37
@MauritsEvers prior to the screenshots, OP offered no info about the data.frame. In that sense its inclusion is useful. In terms of converting lists to dataframes, is there a general function/method that does so? (one that caters to nested lists) – stevec Sep 09 '19 at 00:43
It was my downvote, since retracted. Motivators for it: erring/incomplete answer (fixed) and slightly due to unnecessary complexity of incorporating `dplyr` into this answer (the user never suggested tidyverse-packages, and even with that this is merely a "simplification" *from* `fromJSON(a)` *to* `a %>% fromJSON`, hardly more-readable, IMHO). – r2evans Sep 09 '19 at 00:44
1

@user5783745 I think we all agree on the fact that OP needs to add some clarification. In terms of flattening nested lists, nothing is impossible in R;-) `purrr::flatten`, `purrr::compact` and `dplyr::bind_rows` may be your friends here, perhaps in combination with `purrr::map`ping over nested `list` entries. It all depends on the specifics of OPs expected output. – Maurits Evers Sep 09 '19 at 00:47
@MauritsEvers I wonder if there's a way of using `flatten` or similar to recursively (and without needing to think) simply transform entire lists (with/without nested lists) into a single data.frame? Since `a$entities %>% purrr::flatten() %>% as.data.frame` transforms into a 2 row df, presumably the JSON glob in the OP's question would result in a 2 row df, but theoretically could be as many rows as necessary. – stevec Sep 09 '19 at 00:51
@MauritsEvers [this](https://stackoverflow.com/questions/47603578/flatten-recursive-list/47620571) is the closest I can find. But I can't get any of the answers to work well on OP's data – stevec Sep 09 '19 at 00:56

Is there any way to transform this data format into CSV?

2 Answers2

Sample data