2

Downloaded Facebook data gives me a head ache. It is highly nested (lists of list) and not all lists are equally long. The data should become a flat matrix where one list and its sublists are in one row, i.e. one list including its sublist per row. So far I have explored three options.

Option 1: flatten from purrr

Flattens the data structure but scrambles it. So no way of knowing what text was posted when with what kind of picture. According to the purrr reference manual, I cannot specify an object, e.g. timestamp, by which the lists should be flattened? I am thinking about the reshape2 package that allows to define an ID variable by which the data is reshaped/manipulated.

library(RJSONIO)
#read in data with utf-8 encoding else the German Umlaute won't display
dataRAW <- RJSONIO::fromJSON("C:/***file path***/FB rot 2.json",
                    encoding = 'utf-8', stringAsFactors = F)

dataRAWflat <- purrr:::flatten(dataRAW) #scrambles data

--> I know that jsonlite has a flatten function when reading in JSON files. But fromJSON from jsonlite does not allow to define the encoding. The encoding needs to be defined else it does not display the German Umlaute correctly. Also tried rjson without success. The text of the posts is key to the project. I spent a good amount of figuring out how to display the Umlaute so happy to help with that :-)

Option 2: unnest_wider from tidyr
Gives an error message saying that it should be numeric or a character, but the list 'data' in dataRAW is a character. New to tibbles as a special kind of dataframe. Do tibbles, like dataframes, need to have equally long columns? What am I missing?

library(tibble)
tib <- tibble(dataRAW)
tib %>% tidyr:::unnest_wider(data)
Error: Must extract column with a single valid subscript.
x Subscript `var` has the wrong type `function`.
i It must be numeric or character.
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/vctrs_error_subscript_type>
Must extract column with a single valid subscript.
x Subscript `var` has the wrong type `function`.
i It must be numeric or character.
Backtrace:
  1. tib %>% tidyr:::unnest_wider(data)
  2. tidyr:::unnest_wider(., data)
  3. tidyselect::vars_pull(tbl_vars(data), !!enquo(col))
  4. tidyselect:::pull_as_location2(loc, n, vars)
 12. vctrs::vec_as_subscript2(i, arg = "var", logical = "error")
 13. vctrs:::result_get(...)
Run `rlang::last_trace()` to see the full context.
> rlang:::last_trace()
<error/vctrs_error_subscript_type>
Must extract column with a single valid subscript.
x Subscript `var` has the wrong type `function`.
i It must be numeric or character.
Backtrace:
     x
  1. +-tib %>% tidyr:::unnest_wider(data)
  2. \-tidyr:::unnest_wider(., data)
  3.   \-tidyselect::vars_pull(tbl_vars(data), !!enquo(col))
  4.     \-tidyselect:::pull_as_location2(loc, n, vars)
  5.       +-tidyselect:::with_subscript_errors(...)
  6.       | +-base::tryCatch(...)
  7.       | | \-base:::tryCatchList(expr, classes, parentenv, handlers)
  8.       | |   \-base:::tryCatchOne(expr, names, parentenv, handlers[[1L]])
  9.       | |     \-base:::doTryCatch(return(expr), name, parentenv, handler)
 10.       | \-tidyselect:::instrument_base_errors(expr)
 11.       |   \-base::withCallingHandlers(...)
 12.       \-vctrs::vec_as_subscript2(i, arg = "var", logical = "error")
 13.         \-vctrs:::result_get(...)



Option 3: rapply and lapply
Both code snippets work and preserve the data structure. When I want to convert the data to a matrix for further processing the data structure is messed up. I suspect because the data is still nested one level deep.

#code line returns list nested one level deep
FBraw <- lapply(dataRAW, rapply, f = c)
str(FBraw)
List of 40
 $ : Named chr [1:7] "1611853326" "posts/media/ChronikFotos_QNGAWvS8aw/144245114_3813727445333297_3682316138130576479_n_3813727441999964.jpg" "1611853319" "1613542113" ...
  ..- attr(*, "names")= chr [1:7] "timestamp" "attachments.data.media.uri" "attachments.data.media.creation_timestamp" "attachments.data.media.media_metadata.photo_metadata.exif_data.taken_timestamp" ...
 $ : Named chr [1:7] "1611860575" "posts/media/ChronikFotos_QNGAWvS8aw/143276316_3813978641974844_3663341405860849380_n_3813978635308178.png" "1611860403" "1612935033" ...
  ..- attr(*, "names")= chr [1:7] "timestamp" "attachments.data.media.uri" "attachments.data.media.creation_timestamp" "attachments.data.media.media_metadata.photo_metadata.exif_data.taken_timestamp" ...
 $ : Named chr [1:7] "1612948020" "posts/media/ChronikFotos_QNGAWvS8aw/143732770_3813831571989551_5247994518213519901_n_3813831568656218.png" "1611856188" "1617631305" ...

#code snippet 2 
FBraw <- lapply(dataRAW, function(x) data.frame(t(rapply(x, function(x) x[1]))))
str(FBraw, head = 1)
List of 40
 $ :'data.frame':   1 obs. of  7 variables:
 $ :'data.frame':   1 obs. of  7 variables:
 $ :'data.frame':   1 obs. of  7 variables:



Sample Data

dataRAW <- list(
  list(
    timestamp = 1611853326, attachments = list(list(data = list(
      list(media = list(
        uri = "posts/media/ChronikFotos_QNGAWvS8aw/144245114_3813727445333297_3682316138130576479_n_3813727441999964.jpg",
        creation_timestamp = 1611853319, media_metadata = list(
          photo_metadata = list(exif_data = list(c(taken_timestamp = 1613542113)))
        ),
        title = "Chronik-Fotos", description = "Da haben wir den Salat! <U+0001F957> \nGemischt oder grün: Verfeinert mit Frieda’s Traum Salatsauce wird der einfachste Salat zum Gaumenschmaus.\n\nProbieren Sie auch unsere Gewürze, Bouillons und verschiedene Käse! \nHier finden Sie alle unsere würzigen Produkte:  www.friedas-traum.ch/\n\n<U+0001D46D><U+0001D493><U+0001D48A><U+0001D486><U+0001D485><U+0001D482>'<U+0001D494> <U+0001D47B><U+0001D493><U+0001D482><U+0001D496><U+0001D48E> – Saucen Bouillons Gewürze\nwww.friedas-traum.ch | shop@friedas.ch | Tel. 055 0"
      ))
    ))),
    data = list(post = 1)
  ),
  list(
    timestamp = 1611860575, attachments = list(list(data = list(
      list(media = list(
        uri = "posts/media/ChronikFotos_QNGAWvS8aw/143276316_3813978641974844_3663341405860849380_n_3813978635308178.png",
        creation_timestamp = 1611860403, media_metadata = list(
          photo_metadata = list(exif_data = list(c(taken_timestamp = 1612935033)))
        ),
        title = "Chronik-Fotos", description = "Früher über die Gasse – heute im Online- Shop: <U+0001D5D9><U+0001D5FF><U+0001D5F6><U+0001D5F2><U+0001D5F1><U+0001D5EE>’<U+0001D600> <U+0001D5E7><U+0001D5FF><U+0001D5EE><U+0001D602><U+0001D5FA> Produkte. \n\nWas im Restaurant Löwen in Spreitenbach begann, geht heute online weiter: Sie erhalten 100% Geschmack!\n\nEinfach bestellen im Shop: www.friedas-traum.ch/\n\n<U+0001D46D><U+0001D493><U+0001D48A><U+0001D486><U+0001D485><U+0001D482>’<U+0001D494> <U+0001D47B><U+0001D493><U+0001D482><U+0001D496><U+0001D48E> – Saucen, Bouillons, Gewürze\nshop@friedas.ch  | Tel. +41 (0) 55 0"
      ))
    ))),
    data = list(c(post = "Früher über die Gasse – heute im Online- Shop: <U+0001D5D9><U+0001D5FF><U+0001D5F6><U+0001D5F2><U+0001D5F1><U+0001D5EE>’<U+0001D600> <U+0001D5E7><U+0001D5FF><U+0001D5EE><U+0001D602><U+0001D5FA> Produkte. \n\nWas im Restaurant Löwen in Spreitenbach begann, geht heute online weiter: Sie erhalten 100% Geschmack!\n\nEinfach bestellen im Shop: www.friedas-traum.ch/\n\n<U+0001D46D><U+0001D493><U+0001D48A><U+0001D486><U+0001D485><U+0001D482>’<U+0001D494> <U+0001D47B><U+0001D493><U+0001D482><U+0001D496><U+0001D48E> – Saucen, Bouillons, Gewürze\nshop@friedas.ch  | Tel. +41 (0) 55 0"))
  ),
  list(
    timestamp = 1612948020, attachments = list(list(data = list(
      list(media = list(
        uri = "posts/media/ChronikFotos_QNGAWvS8aw/143732770_3813831571989551_5247994518213519901_n_3813831568656218.png",
        creation_timestamp = 1611856188, media_metadata = list(
          photo_metadata = list(exif_data = list(c(taken_timestamp = 1617631305)))
        ),
        title = "Chronik-Fotos", description = "<U+0001D5E1><U+0001D5EE><U+0001D5F0><U+0001D5F5> <U+0001D5EE><U+0001D5F9><U+0001D601><U+0001D5F2><U+0001D5FA> <U+0001D5E5><U+0001D5F2><U+0001D607><U+0001D5F2><U+0001D5FD><U+0001D601> von Hand gemischt und abgefüllt: Frieda’s Salatsaucen sind beliebt wie eh und je. <U+0001F44C>\n\nFrüher der Renner im Restaurant Löwen in Spreitenbach, heute: DER Hit zum Bestellen für Sie zu Hause.\n\nProbieren Sie auch unsere Bouillons, Gewürze und unseren Käse! \n\nHier geht’s zum Shop:  www.friedas-traum.ch/\n\n<U+0001D46D><U+0001D493><U+0001D48A><U+0001D486><U+0001D485><U+0001D482>’<U+0001D494> <U+0001D47B><U+0001D493><U+0001D482><U+0001D496><U+0001D48E>® – Saucen Bouillons Gewürze\nshop@friedas.ch | Tel. 055 0"
      ))
    ))),
    data = list(c(post = "<U+0001D5E1><U+0001D5EE><U+0001D5F0><U+0001D5F5> <U+0001D5EE><U+0001D5F9><U+0001D601><U+0001D5F2><U+0001D5FA> <U+0001D5E5><U+0001D5F2><U+0001D607><U+0001D5F2><U+0001D5FD><U+0001D601> von Hand gemischt und abgefüllt: Frieda’s Salatsaucen sind beliebt wie eh und je. <U+0001F44C>\n\nFrüher der Renner im Restaurant Löwen in Spreitenbach, heute: DER Hit zum Bestellen für Sie zu Hause.\n\nProbieren Sie auch unsere Bouillons, Gewürze und unseren Käse! \n\nHier geht’s zum Shop:  www.friedas-traum.ch/\n\n<U+0001D46D><U+0001D493><U+0001D48A><U+0001D486><U+0001D485><U+0001D482>’<U+0001D494> <U+0001D47B><U+0001D493><U+0001D482><U+0001D496><U+0001D48E>® – Saucen Bouillons Gewürze\nshop@friedas.ch | Tel. 055 0"))
  )
)
  

Any ideas and suggestions appreciated. Thanks.

danlooo
  • 10,067
  • 2
  • 8
  • 22
Simone
  • 497
  • 5
  • 19
  • @Grothendick thought that the data sample could be easily copied and pasted to create a reproducible example. I knew of dput(x) - it prints out all the data making it a) confusing and b) creates privacy troubles. I am working with personal data. Will add a data sample in form of a file. Have not found a way to [dput the first 3 lists](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/dput) only? – Simone Nov 04 '21 at 14:07
  • or `library(rrapply) rrapply(X, f = head, n = 6, dfaslist = FALSE)` to preserve list structure, see [here](https://stackoverflow.com/questions/64794738/dput-a-long-list-shorten-list-but-preserve-structure) – user63230 Nov 04 '21 at 14:18
  • @G.Grothendieck thanks for the suggestion. It prints out half the data... ;-) Hope the sample data download works? – Simone Nov 04 '21 at 14:22
  • @G.Grothendieck did a manual selection of dput() output. – Simone Nov 04 '21 at 14:38
  • @user63230 thanks for sharing. Code outputs all data points. Also gives a warning message ```In rrapply(dataRAW, f = head, n = 3, dfaslist = FALSE) : 'dfaslist' is deprecated, use classes = 'data.frame' instead ``` - any ways data sample question is fixed now. – Simone Nov 04 '21 at 14:41
  • @Simone I revised my answer to use your example data instead of mine. – danlooo Nov 04 '21 at 15:31

1 Answers1

0

You have a list of elements which have the same properties (e.g. timestamps and attachments). Since these are of different types, you can use a data frame instead of a matrix by enframing the list. Please note the element data which can be either (the first entry) a number or a character (otherwise). We need to enforce the same class required to put these elements into one column using as.character. Doing so, numbers will be converted to characters.

library(tidyverse)

dataRAW <- list(
  list(
    timestamp = 1611853326, attachments = list(list(data = list(
      list(media = list(
        uri = "posts/media/ChronikFotos_QNGAWvS8aw/144245114_3813727445333297_3682316138130576479_n_3813727441999964.jpg",
        creation_timestamp = 1611853319, media_metadata = list(
          photo_metadata = list(exif_data = list(c(taken_timestamp = 1613542113)))
        ),
        title = "Chronik-Fotos", description = "Da haben wir den Salat! <U+0001F957> \nGemischt oder grün: Verfeinert mit Frieda’s Traum Salatsauce wird der einfachste Salat zum Gaumenschmaus.\n\nProbieren Sie auch unsere Gewürze, Bouillons und verschiedene Käse! \nHier finden Sie alle unsere würzigen Produkte:  www.friedas-traum.ch/\n\n<U+0001D46D><U+0001D493><U+0001D48A><U+0001D486><U+0001D485><U+0001D482>'<U+0001D494> <U+0001D47B><U+0001D493><U+0001D482><U+0001D496><U+0001D48E> – Saucen Bouillons Gewürze\nwww.friedas-traum.ch | shop@friedas.ch | Tel. 055 0"
      ))
    ))),
    data = list(post = 1)
  ),
  list(
    timestamp = 1611860575, attachments = list(list(data = list(
      list(media = list(
        uri = "posts/media/ChronikFotos_QNGAWvS8aw/143276316_3813978641974844_3663341405860849380_n_3813978635308178.png",
        creation_timestamp = 1611860403, media_metadata = list(
          photo_metadata = list(exif_data = list(c(taken_timestamp = 1612935033)))
        ),
        title = "Chronik-Fotos", description = "Früher über die Gasse – heute im Online- Shop: <U+0001D5D9><U+0001D5FF><U+0001D5F6><U+0001D5F2><U+0001D5F1><U+0001D5EE>’<U+0001D600> <U+0001D5E7><U+0001D5FF><U+0001D5EE><U+0001D602><U+0001D5FA> Produkte. \n\nWas im Restaurant Löwen in Spreitenbach begann, geht heute online weiter: Sie erhalten 100% Geschmack!\n\nEinfach bestellen im Shop: www.friedas-traum.ch/\n\n<U+0001D46D><U+0001D493><U+0001D48A><U+0001D486><U+0001D485><U+0001D482>’<U+0001D494> <U+0001D47B><U+0001D493><U+0001D482><U+0001D496><U+0001D48E> – Saucen, Bouillons, Gewürze\nshop@friedas.ch  | Tel. +41 (0) 55 0"
      ))
    ))),
    data = list(c(post = "Früher über die Gasse – heute im Online- Shop: <U+0001D5D9><U+0001D5FF><U+0001D5F6><U+0001D5F2><U+0001D5F1><U+0001D5EE>’<U+0001D600> <U+0001D5E7><U+0001D5FF><U+0001D5EE><U+0001D602><U+0001D5FA> Produkte. \n\nWas im Restaurant Löwen in Spreitenbach begann, geht heute online weiter: Sie erhalten 100% Geschmack!\n\nEinfach bestellen im Shop: www.friedas-traum.ch/\n\n<U+0001D46D><U+0001D493><U+0001D48A><U+0001D486><U+0001D485><U+0001D482>’<U+0001D494> <U+0001D47B><U+0001D493><U+0001D482><U+0001D496><U+0001D48E> – Saucen, Bouillons, Gewürze\nshop@friedas.ch  | Tel. +41 (0) 55 0"))
  ),
  list(
    timestamp = 1612948020, attachments = list(list(data = list(
      list(media = list(
        uri = "posts/media/ChronikFotos_QNGAWvS8aw/143732770_3813831571989551_5247994518213519901_n_3813831568656218.png",
        creation_timestamp = 1611856188, media_metadata = list(
          photo_metadata = list(exif_data = list(c(taken_timestamp = 1617631305)))
        ),
        title = "Chronik-Fotos", description = "<U+0001D5E1><U+0001D5EE><U+0001D5F0><U+0001D5F5> <U+0001D5EE><U+0001D5F9><U+0001D601><U+0001D5F2><U+0001D5FA> <U+0001D5E5><U+0001D5F2><U+0001D607><U+0001D5F2><U+0001D5FD><U+0001D601> von Hand gemischt und abgefüllt: Frieda’s Salatsaucen sind beliebt wie eh und je. <U+0001F44C>\n\nFrüher der Renner im Restaurant Löwen in Spreitenbach, heute: DER Hit zum Bestellen für Sie zu Hause.\n\nProbieren Sie auch unsere Bouillons, Gewürze und unseren Käse! \n\nHier geht’s zum Shop:  www.friedas-traum.ch/\n\n<U+0001D46D><U+0001D493><U+0001D48A><U+0001D486><U+0001D485><U+0001D482>’<U+0001D494> <U+0001D47B><U+0001D493><U+0001D482><U+0001D496><U+0001D48E>® – Saucen Bouillons Gewürze\nshop@friedas.ch | Tel. 055 0"
      ))
    ))),
    data = list(c(post = "<U+0001D5E1><U+0001D5EE><U+0001D5F0><U+0001D5F5> <U+0001D5EE><U+0001D5F9><U+0001D601><U+0001D5F2><U+0001D5FA> <U+0001D5E5><U+0001D5F2><U+0001D607><U+0001D5F2><U+0001D5FD><U+0001D601> von Hand gemischt und abgefüllt: Frieda’s Salatsaucen sind beliebt wie eh und je. <U+0001F44C>\n\nFrüher der Renner im Restaurant Löwen in Spreitenbach, heute: DER Hit zum Bestellen für Sie zu Hause.\n\nProbieren Sie auch unsere Bouillons, Gewürze und unseren Käse! \n\nHier geht’s zum Shop:  www.friedas-traum.ch/\n\n<U+0001D46D><U+0001D493><U+0001D48A><U+0001D486><U+0001D485><U+0001D482>’<U+0001D494> <U+0001D47B><U+0001D493><U+0001D482><U+0001D496><U+0001D48E>® – Saucen Bouillons Gewürze\nshop@friedas.ch | Tel. 055 0"))
  )
)

dataRAW %>%
  enframe()
#> # A tibble: 3 × 2
#>    name value           
#>   <int> <list>          
#> 1     1 <named list [3]>
#> 2     2 <named list [3]>
#> 3     3 <named list [3]>

dataRAW %>%
  enframe() %>%
  unnest_wider(value)
#> # A tibble: 3 × 4
#>    name  timestamp attachments data            
#>   <int>      <dbl> <list>      <list>          
#> 1     1 1611853326 <list [1]>  <named list [1]>
#> 2     2 1611860575 <list [1]>  <list [1]>      
#> 3     3 1612948020 <list [1]>  <list [1]>

dataRAW %>%
  enframe() %>%
  unnest_wider(value) %>%
  # flatten list with only one element
  unnest(data) %>%
  # Enforce data to have the same type
  mutate(data = data %>% as.character()) %>%
  unnest(data) %>%
  unnest(attachments) %>%
  unnest(attachments) %>%
  unnest(attachments) %>%
  unnest(attachments) %>%
  unnest_wider(attachments) %>%
  select(name, timestamp, creation_timestamp, title, data)
#> # A tibble: 3 × 5
#>    name  timestamp creation_timestamp title         data                        
#>   <int>      <dbl>              <dbl> <chr>         <chr>                       
#> 1     1 1611853326         1611853319 Chronik-Fotos "1"                         
#> 2     2 1611860575         1611860403 Chronik-Fotos "Früher über die Gasse – he…
#> 3     3 1612948020         1611856188 Chronik-Fotos "<U+0001D5E1><U+0001D5EE><U…

Created on 2021-11-04 by the reprex package (v2.0.1)

danlooo
  • 10,067
  • 2
  • 8
  • 22
  • Thanks let me try this out – Simone Nov 04 '21 at 15:59
  • The code works until line 4, i.e. the second ```unnest(value)```. Then it gives me an error Error: Can't combine `..1$data` and `..20$data` . Backtrace: 1. `%>%`(...) 3. tidyr:::unnest.data.frame(., data) 4. tidyr::unchop(data, any_of(cols), keep_empty = keep_empty, ptype = ptype) 5. tidyr:::df_unchop_info(cols, ptype) 6. vctrs::vec_unchop(pieces, ptype = col_ptype) 8. vctrs::vec_default_ptype2(...) 9. vctrs::stop_incompatible_type(...) 10. vctrs:::stop_incompatible(...) 11. vctrs:::stop_vctrs(...) – Simone Nov 04 '21 at 16:50
  • @Simone I edited the example data and my answer to fix the problem – danlooo Nov 04 '21 at 20:51
  • it throws an error at line 11 ```unnest_wider(attachments)```saying ```Error: Can't combine `..1$uri` and `..30$uri` .``` **Backtrace**: 1. `%>%`(...) 2. tidyr::unnest_wider(., attachments, names_repair = "unique") 3. tidyr::unchop(data, any_of(col), keep_empty = TRUE) 4. tidyr:::df_unchop_info(cols, ptype) 5. vctrs::vec_unchop(pieces, ptype = col_ptype) 7. vctrs::vec_default_ptype2(...) 8. vctrs::stop_incompatible_type(...) 9. vctrs:::stop_incompatible(...) 10. vctrs:::stop_vctrs(...) – Simone Nov 08 '21 at 09:40
  • You need to unify the types of `attachments` in the same way it is done for `data`. Do you have always one `data` element and sometimes many `attachments` per post?, Then create multiple rows instead of multiple columns using `unnest(attachments)` instead of `unnest_wider(attachments)` – danlooo Nov 08 '21 at 09:47
  • tried it out with the data sample I provided on SO instead of data I have. The code ran through but it did not flatten the data? ```dataRAW$name``` or ```dataRAW[ ,"name"]``` return NULL and error message respectively but this is [how to access elements in tibbles](https://tibble.tidyverse.org/reference/subsetting.html) – Simone Nov 08 '21 at 09:57
  • The data structure changes in at least 2 ways: **A)** the top-level list has no sublist *attachments*. **B)** the sub^4-list *media* in the sublist *attachments* has sometimes a sublist *thumbnail*. Sometimes not. ...I am starting to think that maybe it is easiest to just retrieve the posts? Saw the [tidyjson package](https://cran.r-project.org/web/packages/tidyjson/vignettes/introduction-to-tidyjson.html) – Simone Nov 08 '21 at 10:05
  • There is a reason why Facebook gives you the json: There are elements of different types with different properties (e.g. post with properties data and timestamp and another type called attachment with properties like thumbnail). A table is meant to show elements of one spectic type. Dou you want to have a table of attachments and another table of post_contents ? – danlooo Nov 08 '21 at 10:09
  • haha - thought about that today and last week too. Yes Facebook *has* to give me my data according to GDPR, but they sure make it hard to work with it. As a minimum requirement I need the posts text. Working also with JSON files from other Facebook accounts (ppl give me their own downloaded data) with yet a different data structure. :-/ Though the *list names* remain the same – Simone Nov 08 '21 at 10:16
  • the alternative is a HTML file from where I need to collect all the appropriate fields. Not sure if that is better? Not experience with reading in HTML in R. – Simone Nov 08 '21 at 10:19
  • HTML is internal a tree structure as well. It's to the scientific definition of a table that we should not use a table to describe objects of different classes. – danlooo Nov 08 '21 at 10:45