New columns using str_extract_all in a tidyverse way

Question

Problem:

I want to create a data frame with strings matched from a long string column. In the data frame "d" the genre column is a string column with different genres. The problem appears because each row hasn't the same numbers of items matched.

library(tidyverse)

d <- structure(list(genres = c("[{'id': 35, 'name': 'Comedy'}]", "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'}, {'id': 10749, 'name': 'Romance'}]", 
                               "[{'id': 16, 'name': 'Animation'}, {'id': 12, 'name': 'Adventure'}, {'id': 10751, 'name': 'Family'}]"
), budget = c(1.4e+07, 4e+07, 8e+06)), row.names = c(NA, -3L), class = c("tbl_df", 
                                                                         "tbl", "data.frame"))
d
#> # A tibble: 3 x 2
#>   genres                                                             budget
#>   <chr>                                                               <dbl>
#> 1 [{'id': 35, 'name': 'Comedy'}]                                     1.40e7
#> 2 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id…  4.00e7
#> 3 [{'id': 16, 'name': 'Animation'}, {'id': 12, 'name': 'Adventure'…  8.00e6

My inelegant way

I have found a workflow to work around with this which is a little inelegant. First, extract all the matches with str_extract_all with simplify = T which returns a data frame. Then create a vector string with the column names, assign to the extracted data frame and finally use bind_cols:


foo <- str_extract_all(d$genres, '(?<=\'name\':\\s\')([^\']*)', simplify = T)
colnames(foo) <- paste0("genre_", 1:ncol(foo), "_extract")
foo <- foo %>% as_tibble()
foo_final <- bind_cols(d, foo)

foo_final 
#> # A tibble: 3 x 6
#>   genres budget genre_1_extract genre_2_extract genre_3_extract
#>   <chr>   <dbl> <chr>           <chr>           <chr>          
#> 1 [{'id… 1.40e7 Comedy          ""              ""             
#> 2 [{'id… 4.00e7 Comedy          Drama           Family         
#> 3 [{'id… 8.00e6 Animation       Adventure       Family         
#> # … with 1 more variable: genre_4_extract <chr>

^{Created on 2019-03-10 by the reprex package (v0.2.1)}

I would like to know if there is a way to make it in a tidyverse way within pipe operators, mutate, or map_df... I am sure there is a better way of doing this.

Why don't you process them like JSON by using `jsonlite` package, Working with key-values pairs is much easier that way — Sonny, Mar 10 '19 at 14:41
It's a general data frame I have more columns without json pattern but with the same problem. — Tito Sanz, Mar 10 '19 at 14:44
@TitoSanz Can you include columns "without json pattern but with the same problem" in your sample dataset? — Z.Lin, Mar 10 '19 at 15:13

score 2 · Accepted Answer · answered Mar 11 '19 at 05:30

Answer to the original question

I'm not sure this is any more elegant than your solution, but it's pure pipes:

# Use `str_count` to determine the maximum number of genres associated with any
# single row, and create one column for each genre.  Stored here for
# convenience, since it's needed twice below.
genre.col.names = paste("genre",
                        1:(max(str_count(d$genres, "\\}, \\{")) + 1),
                        sep = "_")
# Use `separate` to split the `genres` column into one column for each genre.
# Then use `mutate` and some regular expressions to populate each column with
# just the genre name and no other material (quotes, brackets, etc.).
d %>%
  separate(genres,
           into = genre.col.names,
           sep = "\\}, \\{") %>%
  mutate_(.dots = setNames(paste("gsub(\".*'name': '([A-Za-z]+)'.*\", \"\\\\1\", ",
                                 genre.col.names,
                                 ")",
                                 sep = ""),
                           genre.col.names))

Putting the data into long format

It seems to me that a tidier format would be long: one row per movie per genre. (I'm assuming each record is a movie...?) If that format works for you, here's a way to get it (and I'm sure there are even simpler ways):

library(fuzzyjoin)
# Get all the genres appearing in any record and put them in a dataframe.
all.genres = data.frame(
  genre.name = unique(unlist(str_extract_all(d$genres, '(?<=\'name\':\\s\')([^\']*)')))
)
# Left join the main data frame to the genre list using a regular expression.
d %>%
  rownames_to_column() %>%
  regex_left_join(all.genres, by = c("genres" = "genre.name")) %>%
  select(rowname, genre.name)

Flag columns

If you really need the data in wide format, how about one column for each possible genre, and the value is TRUE/FALSE? This gets unwieldy if there are lots of possible genres, of course. But the advantage is that information about whether this movie is a comedy, for example, is always in the same column. (With genre_1, genre_2, etc., you have to check every column to find out whether the movie is a comedy or not.) Here's a way to get that:

# Using the list of unique genres that we created earlier, add one column for
# each genre that contains a flag for whether the record is an example of that
# genre.
d %>%
  mutate_(.dots = setNames(paste("grepl(\"", all.genres$genre.name, "\", genres)",
                                 sep = ""),
                           all.genres$genre.name))

Thanks for the answer! It is very instructive, I have learned a lot. Would you mind to detail a little more the last option FLAG COLUMNS? This seems the best way to deal with posterior analysis. But I am not able to get how it exactly works. Thanks in advance once again!!! — Tito Sanz, Mar 11 '19 at 18:10
The idea is to create one column for each genre, and it will flag all the records in that genre. So the column `Comedy` would contain `TRUE` for comedies and `FALSE` for all other records. I like it because it makes it easier to find a specific genre (e.g., average budget for comedies would be `mean(d$budget[d$Comedy])`. But the table gets really wide if the full dataset has lots of genres. It's also not easy to create columns with dynamic names in `dplyr`, so we have to do something like [this](https://stackoverflow.com/a/32762285/10898875), and the code isn't the prettiest. — A. S. K., Mar 11 '19 at 20:31