Parse JSON column in Pandas or R Dataframe

Question

I'm dealing with a CSV with JSON data for the first time. I have uploaded the CSV file into Python (and I try also in R) as a dataframe. The CSV looks like this:

This file is about sales. In every row I have the purchase details. One client can buy several products. I have lots of columns with details like delivery address, payment, etc. The column ITEMS has JSON data inside (in Python is an Object type). What I want is to flatten that column into several columns and join them into the original dataframe. The JSON looks like this (Notice that this client have bought two items, so I'll have to join the new columns and "repeat" the row with the client info):

[
   {
      "_id": {
         "$oid": "5fff7d777bdb300010b1f172"
      },
      "product": {
         "productCategories": [
            {
               "$oid": "2afe7b8b3a2c4e1b4a8b22f5"
            },
            {
               "$oid": "ca9bf1e1d9adf5b170b12fae"
            }
         ],
         "_id": {
            "$oid": "5ffc9c35fc8985001803d8b5"
         },
         "title": "Rice",
         "productType": {
            "$oid": "e1f2a11798b7f81ce44a5696"
         },
         "store": {
            "$oid": "5ffb729242ad2200107dcb42"
         },
         "price": 50,
         "stockMethod": {
            "_id": {
               "$oid": "5ffc9c49fc8985001803d8b7"
            },
            "value": "always"
         },
         "stock": null,
         "showInStore": true,
         "available": true,
         "createdAt": {
            "$date": "2021-01-11T18:43:01.103Z"
         },
         "modifiedAt": {
            "$date": "2021-01-13T23:08:39.102Z"
         },
         "__v": 0
      },
      "quantity": 1
   },
   {
      "_id": {
         "$oid": "5fff7d777bdb300010b1f175"
      },
      "product": {
         "productCategories": [
            {
               "$oid": "d4abd6888580e00d9f746cac"
            },
            {
               "$oid": "a018049c6fc0253ee5aa6a8f"
            }
         ],
         "_id": {
            "$oid": "5ffcaa90fc8985001803d97c"
         },
         "title": "Beans",
         "productType": {
            "$oid": "e1f2a11798b7f81ce44a5696"
         },
         "store": {
            "$oid": "5ffb729242ad2200107dcb42"
         },
         "price": 130,
         "stockMethod": {
            "_id": {
               "$oid": "5ffcaa90fc8985001803d97d"
            },
            "value": "alwasy"
         },
         "stock": null,
         "showInStore": true,
         "available": true,
         "createdAt": {
            "$date": "2021-01-11T19:44:16.004Z"
         },
         "modifiedAt": {
            "$date": "2021-01-13T23:08:39.102Z"
         },
         "__v": 0
      },
      "quantity": 1
   }
]

I have tried different solutions like this and this but I can't solve it. For example, I've tried:

import json
from io import StringIO
stdf = orders['items'].apply(json.loads)
stlst = list(stdf)
stjson = json.dumps(stlst)
pd.read_json(StringIO(stjson))

And I get something like:

danlooo · Answer 1 · 2021-11-24T15:59:56.510

Assuming you have a file called data.json you can do in R:

library(jsonlite)
library(tidyverse)

read_json("data.json", simplifyVector = TRUE) %>%
  as_tibble()
#> # A tibble: 2 x 3
#>   `_id`$`$oid`  product$productC… $`_id`$`$oid` $title $productType$`$… quantity
#>   <chr>         <list>            <chr>         <chr>  <chr>               <int>
#> 1 5fff7d777bdb… <df [2 × 1]>      5ffc9c35fc89… Rice   e1f2a11798b7f81…        1
#> 2 5fff7d777bdb… <df [2 × 1]>      5ffcaa90fc89… Beans  e1f2a11798b7f81…        1

^{Created on 2021-11-24 by the reprex package (v2.0.1)}

And for json data inside a csv file:

library(tidyverse)
library(jsonlite)
#> 
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#> 
#>     flatten

data <- tibble(
  id = c(1,2),
  data = c("{\"foo\":1, \"bar\":2}", "{\"foo\":10, \"bar\":20}")
)
data
#> # A tibble: 2 x 2
#>      id data                      
#>   <dbl> <chr>                     
#> 1     1 "{\"foo\":1, \"bar\":2}"  
#> 2     2 "{\"foo\":10, \"bar\":20}"

data %>%
  mutate(
    data = data %>% map(~ parse_json(.x, simplifyVector = TRUE))
  ) %>%
  unnest_wider(data)
#> # A tibble: 2 x 3
#>      id   foo   bar
#>   <dbl> <int> <int>
#> 1     1     1     2
#> 2     2    10    20

^{Created on 2021-11-24 by the reprex package (v2.0.1)}

I don't have a JSON file, I have a CSV file with one column with JSON format. The rest of the columns are integers, strings, etc. — Pablo, Nov 24 '21 at 13:33
that works. I'm trying to figure out how to do it in Python, but for now I can use R first and then export the file to Python — Pablo, Nov 29 '21 at 19:30

Parse JSON column in Pandas or R Dataframe

1 Answers1