1

I currently have the following problem: I extracted some data via the crunchbase API, resulting in a big nested list of the following structure (there are many more nested lists on several instances included, I here only display the part of the structure currently relevant for me):

> str(x[[1]])
$ uuid         : chr "5f9957b0841251e6e439d757XXXXXX"
$ relationships: List of 27
..$ websites: List of 3
.. ..$ cardinality: chr "OneToMany"
.. ..$ items      :'data.frame':    4 obs. of  7 variables:
.. .. ..$ properties.website_type: chr [1:4] "homepage" "facebook" "twitter" "linkedin"
.. .. ..$ properties.url         : chr [1:4] "http://www.example.com" "https://www.facebook.com/example" "http://twitter.com/example" "http://www.linkedin.com/company/example"

Consider the following minimal example:

x <- list()
x[[1]] <- list(uuid = "123", 
           relationships = list(websites = list(items =  list(
                                                properties.website_type = c("homepage", "facebook", "twitter", "linkedin"), 
                                                properties.url = c("www.example1.com", "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com") ) )  ) )
x[[2]] <- list(uuid = "987", 
           relationships = list(websites = list(items =  list(
             properties.website_type = c("homepage", "facebook", "twitter" ), 
             properties.url = c("www.example2.com", "www.fbex2.com", "www.twitterex2.com") ) )  ) )

Now, I would like to create a dataframe with the following column structure:

> x.df
uuid          web.url  web.facebook        web.twitter        web.linkedin
1  123 www.example1.com www.fbex1.com www.twitterex1.com www.linkedinex1.com
2  987 www.example2.com www.fbex2.com www.twitterex2.com                <NA>

Meaning: I would like to have every uuid (a unique firm identifier) in a single column, followed by the urls of the different platforms (fb, twitter...). I tried a lot of different things with a combination of lapply(), spread(), and row_bind(), yet didn't manage to make anything work. Any help on that would be appreciated.

dpel
  • 1,954
  • 1
  • 21
  • 31

2 Answers2

1

dplyr approach could be

library(dplyr)
library(tidyr)

#convert list to dataframe in long format
df <- do.call(rbind, lapply(x, data.frame, stringsAsFactors = FALSE))

#final result
df1 <- df %>%
  spread(relationships.websites.items.properties.website_type, relationships.websites.items.properties.url)

which gives

  uuid      facebook         homepage            linkedin            twitter
1  123 www.fbex1.com www.example1.com www.linkedinex1.com www.twitterex1.com
2  987 www.fbex2.com www.example2.com                <NA> www.twitterex2.com


Sample data:

x <- list(structure(list(uuid = "123", relationships = structure(list(
    websites = structure(list(items = structure(list(properties.website_type = c("homepage", 
    "facebook", "twitter", "linkedin"), properties.url = c("www.example1.com", 
    "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com"
    )), .Names = c("properties.website_type", "properties.url"
    ))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
"relationships")), structure(list(uuid = "987", relationships = structure(list(
    websites = structure(list(items = structure(list(properties.website_type = c("homepage", 
    "facebook", "twitter"), properties.url = c("www.example2.com", 
    "www.fbex2.com", "www.twitterex2.com")), .Names = c("properties.website_type", 
    "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
"relationships")))


Update: In order to fix below error

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0

you would need to remove corrupted elements from input data where website_type has one value but properties.url has NULL. Run this chunk of code as a pre-processing step before executing the main solution:

idx <- which(sapply(x, function(k) is.null(k$relationships$websites$items$properties.url)))
x <- x[-idx]

Sample data to test this pre-processing step:

x <- list(structure(list(uuid = "123", relationships = structure(list(
    websites = structure(list(items = structure(list(properties.website_type = c("homepage", 
    "facebook", "twitter", "linkedin"), properties.url = c("www.example1.com", 
    "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com"
    )), .Names = c("properties.website_type", "properties.url"
    ))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
"relationships")), structure(list(uuid = "987", relationships = structure(list(
    websites = structure(list(items = structure(list(properties.website_type = "homepage", 
        properties.url = NULL), .Names = c("properties.website_type", 
    "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
"relationships")), structure(list(uuid = "345", relationships = structure(list(
    websites = structure(list(items = structure(list(properties.website_type = "homepage", 
        properties.url = NULL), .Names = c("properties.website_type", 
    "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
"relationships")))
Prem
  • 11,775
  • 1
  • 19
  • 33
  • Great, that generally seems to be what I need. Runs perfectly with the example. However, when I try it with my full dataset, I always get an error message: "Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0" Any idea what the problem could be? – Daniel S. Hain Jun 26 '18 at 15:06
  • Probably you have an element in your sample data wherein number of values in `$relationships$websites$items$properties.website_type` & `$relationships$websites$items$properties.url` is not matching. Because of this `data.frame` is throwing this error. So first you need to think on how do you want to handle such cases i.e. website_type is there but url is missing. – Prem Jun 26 '18 at 17:34
  • Indeed, on that, you are probably right! I didn\t consider that case. In case the url is missing, it in the optimal case should be an NA. – Daniel S. Hain Jun 26 '18 at 20:54
  • I think you are missing a point here. Consider this example and let me know the desired output - `x <- structure(list(uuid = "123", relationships = structure(list(websites = structure(list( items = structure(list(properties.website_type = c("homepage", "facebook", "twitter", "linkedin"), properties.url = c("www.example1.com", "www.fbex1.com", "www.linkedinex1.com")), .Names = c("properties.website_type", "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", "relationships"))` Here twitter has no url in this example and gives the same error. – Prem Jun 27 '18 at 07:06
  • Hello again. Sorry, was in transit for some time and now catching up. I think I understand the problem correctly. To keep it simple, I actually only want to consider the website, facebook, twitter and linkedin, even though there might be actually more types of URLs in that part of the list. The desired output structure then would be exactly like the one I posted in the original question. About what happens when the number of values in url and website.type not match, I am not emotional. I think these cases are very rare, and could also be deleted alltogether. – Daniel S. Hain Jun 29 '18 at 00:47
  • You just answered your question :) For the error message you mentioned in your first comment I would suggest to check if any list element in your data has one `properties.website_type` but no `properties.url` and then delete that record and you should be all set. – Prem Jun 29 '18 at 17:49
  • That's correct. Do you maybe have a hint how to do that? My original data is rather large, so not possible to check by hand. – Daniel S. Hain Jun 30 '18 at 06:35
  • try `which(sapply(x, function(k) is.null(k$relationships$websites$items$properties.url)))` to get the list index having corrupted data and then remove those indexes. – Prem Jun 30 '18 at 17:52
  • Thanks a lot. However, for the structure of x you provided earlier it throws an error "Error in k$relationships : $ operator is invalid for atomic vectors" – Daniel S. Hain Jul 03 '18 at 03:03
  • See the updated answer for more detail. BTW the structure given in my earlier comments is slightly different than the real data what you have. I had just provided that to convey my point. – Prem Jul 03 '18 at 06:15
0

I know this is a clunkier solution, but it helped me seeing the process step by step (running str (x_df) to see each result):

library(tidyverse)

# Using your example
x <- list()
x[[1]] <- list(uuid = "123",
                    relationships = list(websites = list(items =  list(
                        properties.website_type = c("homepage", "facebook", "twitter", "linkedin"),
                        properties.url = c("www.example1.com", "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com") ) )  ) )
x[[2]] <- list(uuid = "987",
                    relationships = list(websites = list(items =  list(
                        properties.website_type = c("homepage", "facebook", "twitter" ),
                        properties.url = c("www.example2.com", "www.fbex2.com", "www.twitterex2.com") ) )  ) )

 

# --- Iterations of unnest:
x_df <- x %>% tibble::as_tibble_col( .) %>%  
    tidyr::unnest_wider(col = "value")  %>% 
    tidyr::unnest_longer(col = "relationships")   %>%  
    tidyr::unnest_wider(col = "relationships")  %>%  
    tidyr::unnest_wider(col =  "items")  %>%  
    tidyr::unnest_longer(col = c("properties.website_type", "properties.url")) %>% 
# --- Lastly, group by id: 
    group_by(uuid) %>% 
    tidyr::pivot_wider(data = ., 
                             names_from = properties.website_type, 
                             values_from = c("properties.url"))
Lulliter
  • 87
  • 1
  • 7