0

I have this dataset from Kaggle that is JSON.

beer <- fromJSON('recipes_full.txt')

It imports as a giant list. For example, each beer#, I'll start with 0, has a list of 19.

What 'beer' looks likeEach of those lists has a single value, except for 4 of them, which are different sized tables, like 8x2, 5x5, 2x4, ect. Expanded '0'

Example data:

    $`0`
$`0`$name
[1] "Vanilla Cream Ale"

$`0`$url
[1] "/homebrew/recipe/view/1633/vanilla-cream-ale"

$`0`$method
[1] "All Grain"

$`0`$style
[1] "Cream Ale"

$`0`$batch
[1] 21.8

$`0`$og
[1] 1.055

$`0`$fg
[1] 1.013

$`0`$abv
[1] 5.48

$`0`$ibu
[1] 19.44

$`0`$color
[1] 4.83

$`0`$`ph mash`
[1] -1

$`0`$fermentables
     [,1]    [,2]                                  [,3] [,4]  [,5]  
[1,] "2.381" "American - Pale 2-Row"               "37" "1.8" "44.7"
[2,] "0.907" "American - White Wheat"              "40" "2.8" "17"  
[3,] "0.907" "American - Pale 6-Row"               "35" "1.8" "17"  
[4,] "0.227" "Flaked Corn"                         "40" "0.5" "4.3" 
[5,] "0.227" "American - Caramel / Crystal 20L"    "35" "20"  "4.3" 
[6,] "0.227" "American - Carapils (Dextrine Malt)" "33" "1.8" "4.3" 
[7,] "0.113" "Flaked Barley"                       "32" "2.2" "2.1" 
[8,] "0.34"  "Honey"                               "42" "2"   "6.4" 

$`0`$hops
     [,1] [,2]      [,3]     [,4]  [,5]   [,6]     [,7]    [,8]  
[1,] "14" "Cascade" "Pellet" "6.2" "Boil" "60 min" "11.42" "33.3"
[2,] "14" "Cascade" "Pellet" "6.2" "Boil" "20 min" "6.92"  "33.3"
[3,] "14" "saaz"    "Pellet" "3"   "Boil" "5 min"  "1.1"   "33.3"

$`0`$`hops Summary`
     [,1] [,2]               [,3]    [,4]  
[1,] "28" "Cascade (Pellet)" "18.34" "66.6"
[2,] "14" "saaz (Pellet)"    "1.1"   "33.3"

$`0`$other
     [,1]     [,2]                           [,3]     [,4]        [,5]     
[1,] "2 oz"   "pure vanilla extract"         "Flavor" "Boil"      "0 min." 
[2,] "1 oz"   "pure vanilla extract"         "Flavor" "Bottling"  "0 min." 
[3,] "1 tsp"  "yeast nutrient"               "Other"  "Boil"      "15 min."
[4,] "1 each" "whirlfloc"                    "Fining" "Boil"      "15 min."
[5,] "4 each" "Vanilla beans - in 2oz Vodka" "Other"  "Secondary" "0 min." 

$`0`$yeast
[1] "Wyeast - Kölsch 2565" "76%"                  "Low"                 
[4] "56"                   "70"                   "Yes"                 

$`0`$rating
[1] 0

$`0`$`num rating`
[1] 16

$`0`$views
[1] 289454

As you can see, everything but the "fermentables", "hops", "other", and "yeast" is easy to deal with. I have NO idea what to do with those tables though. I've been scouring StackOverflow and trying different methods, but most of them require the dataset being converted to a dataframe, but I'm getting 'blocked' by these tables. I imagine I would want to isolate them and try to turn it into long data, but I'm not sure how to isolate those. Any suggestions or libraries and documentation I can read up on? Thanks for any help! I've tried to deal with in python previous and still couldn't figure it out.

Edit: I know that I can access them individually via "beer[["0"]][["fermentables"]]", however I can't figure out how to access more than one at a time, which is also throwing me off.

dput() of data for the first beer:

list(`0` = list(name = "Vanilla Cream Ale", url = "/homebrew/recipe/view/1633/vanilla-cream-ale", 
method = "All Grain", style = "Cream Ale", batch = 21.8, 
og = 1.055, fg = 1.013, abv = 5.48, ibu = 19.44, color = 4.83, 
`ph mash` = -1L, fermentables = structure(c("2.381", "0.907", 
"0.907", "0.227", "0.227", "0.227", "0.113", "0.34", "American - Pale 2-Row", 
"American - White Wheat", "American - Pale 6-Row", "Flaked Corn", 
"American - Caramel / Crystal 20L", "American - Carapils (Dextrine Malt)", 
"Flaked Barley", "Honey", "37", "40", "35", "40", "35", "33", 
"32", "42", "1.8", "2.8", "1.8", "0.5", "20", "1.8", "2.2", 
"2", "44.7", "17", "17", "4.3", "4.3", "4.3", "2.1", "6.4"
), .Dim = c(8L, 5L)), hops = structure(c("14", "14", "14", 
"Cascade", "Cascade", "saaz", "Pellet", "Pellet", "Pellet", 
"6.2", "6.2", "3", "Boil", "Boil", "Boil", "60 min", "20 min", 
"5 min", "11.42", "6.92", "1.1", "33.3", "33.3", "33.3"), .Dim = c(3L, 
8L)), `hops Summary` = structure(c("28", "14", "Cascade (Pellet)", 
"saaz (Pellet)", "18.34", "1.1", "66.6", "33.3"), .Dim = c(2L, 
4L)), other = structure(c("2 oz", "1 oz", "1 tsp", "1 each", 
"4 each", "pure vanilla extract", "pure vanilla extract", 
"yeast nutrient", "whirlfloc", "Vanilla beans - in 2oz Vodka", 
"Flavor", "Flavor", "Other", "Fining", "Other", "Boil", "Bottling", 
"Boil", "Boil", "Secondary", "0 min.", "0 min.", "15 min.", 
"15 min.", "0 min."), .Dim = c(5L, 5L)), yeast = c("Wyeast - Kölsch 2565", 
"76%", "Low", "56", "70", "Yes"), rating = 0L, `num rating` = 16L, 
views = 289454L)
Basalty
  • 55
  • 8
  • I am looking to do some simple data exploration. For example, I'd like to see something like average 'fermentables' weight (which is the first column of the table for that input) for all IPA's. I added the dput() for you for the first beer. I don't have a preference for what it looks like, but I, personally, don't know how to access them recursively. I'd like to have the option of accessing it to make a new custom df, but I don't know what that process might look like. – Basalty May 13 '21 at 17:03
  • Not sure what happened @user2554330, your answer seems to have disappeared?! – Basalty May 14 '21 at 16:54
  • You're allowed to delete your own answers, though I thought @user2554330, summed up your challenge and an approach well. I think if you go into edit mode you'll see it there. I fitfully tried `RcppSimdJson`, that was wicked fast on parsing, but once de-parsed showed Kaggle was sending a database dump that can't be handled as data.frames (that variability in fermentables entries, and all). So perhaps figure out what database it came out of and ingest the json into it. At which point my head exploded. Kaggle is meant to prove you'll spend all your time data wrangling and make no pretty charts. – Chris May 14 '21 at 17:04
  • Ha, well, I just feel better that my head wasn't the only one exploding. The comment isn't show up though. Sad. It had a lot of other useful information, even if he didn't quite figure it out. – Basalty May 14 '21 at 19:01
  • For the purposes of R, he /you ID'd the troublesome 'tables' , pulled out the stuff that weren't the troublesome, and suggested making the troublesome their own objects, which gives a kind of analog of a relational database as lapply will leave them in the order presented. For your continued enjoyment, and probably allowing your to get past wrangling to charting and modelling, [postgres ingest json?](https://stackoverflow.com/questions/39224382/how-can-i-import-a-json-file-into-postgresql) or any other db, plenty discussed here. Have fun. – Chris May 14 '21 at 21:03

0 Answers0