0

Ok, so here is my scenario: I have a dataset with a column composed of lists of words (keyword tags for YT videos, where each row is video data).

What I want to do is do a complete count of all unique object instances within these lists, for the entire column. So basically what I want in the end is a table with two fields: keyword, count.

If I just do a simple dplyr query, then it counts the list itself as a unique object. While this is also interesting, this is not what I want.

So this is the above dplyr query that I want to utilize further, but not sure how to nest unique instances within the unique lists:

vid_tag_freq = df %>%
  count(tags)

To further clarify:

With a dataset like:

     Tags
1    ['Dog', 'Cat', 'Mouse', 'Fish']
2    ['Cat', 'Fish']
3    ['Cat', 'Fish']

I am now getting: 

    Tags                                Count
1   ['Dog', 'Cat', 'Mouse', 'Fish']     1
2   ['Cat', 'Fish']                     2

What I actually want:

    Tags              Count
1   'Cat'             3
2   'Fish'            3
3   'Dog'             1
4   'Mouse'           1

I hope that explains it lol

EDIT: This is what my data looks like, guess most are lists of lists? Maybe I should clean up [0]s as null?

[1] "[['Flood (Disaster Type)', 'Burlington (City/Town/Village)', 'Ontario (City/Town/Village)']]"                                                                                                                                                                                                                                                                                                                                                                                               
   [2] "[0]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
   [3] "[0]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
   [4] "[['Rocket (Product Category)', 'Interview (TV Genre)', 'Canadian Broadcasting Corporation (TV Network)', 'Israel (Country)', 'Gaza War (Military Conflict)']]"                                                                                                                                                                                                                                                                                                                              
   [5] "[0]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
   [6] "[['Iraq (Country)', 'Military (Film Genre)', 'United States Of America (Country)']]"                                                                                                                                                                                                                                                                                                                                                                                                        
   [7] "[['Ebola (Disease Or Medical Condition)', 'Chair', 'Margaret Chan (Physician)', 'WHO']]"                                                                                                                                                                                                                                                                                                                                                                                                    
   [8] "[['CBC Television (TV Network)', 'CBC News (Website Owner)', 'Canadian Broadcasting Corporation (TV Network)']]"                                                                                                                                                                                                                                                                                                                                                                            
   [9] "[['Rob Ford (Politician)', 'the fifth estate', 'CBC Television (TV Network)', 'Bill Blair', 'Gillian Findlay', 'Documentary (TV Genre)']]"                                                                                                                                                                                                                                                                                                                                                  
  [10] "[['B.C.', 'Dog Walking (Profession)', 'dogs', 'dog walker', 'death', 'dead']]"                                                                                                                                                                                                                                                                                                                                                                                                              
  [11] "[['Suicide Of Amanda Todd (Event)', 'Amanda Todd', 'cyberbullying', 'CBC Television (TV Network)', 'the fifth estate', 'Mark Kelley', 'cappers', 'Documentary (TV Genre)']]"                                                                                                                                                                                                                                                                                                                
  [12] "[['National Hockey League (Sports Association)', 'Climate Change (Website Category)', 'Hockey (Sport)', 'greenhouse gas', 'emissions']]"                                                                                                                                                                                                                                                                                                                                                    
  [13] "[['Rob Ford (Politician)', 'bomb threat', 'Toronto (City/Town/Village)', 'City Hall (Building)']]"                                                                                                                                                                                                                                                                                                                                                                                          
  [14] "[['Blue Jays', 'Ashes', 'friends']]"                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  [15] "[['Robin Williams (Celebrity)', 'Peter Gzowski']]"

  • The data in your edit looks like a flat character vector. Are you use that's a list? if you can dput() *some* of the data we can confirm exactly what the data look like. – Jordan Oct 08 '22 at 15:24

2 Answers2

2

It would help if you could dput() some of the data for a working example. Going off the idea that you have a list column, here are a couple of general solutions you may be able to work with:

df <- tibble::tibble(
  x = replicate(10, sample(state.name, sample(5:10, 1), TRUE), simplify = FALSE)
)

df
#> # A tibble: 10 × 1
#>    x         
#>    <list>    
#>  1 <chr [7]> 
#>  2 <chr [7]> 
#>  3 <chr [8]> 
#>  4 <chr [6]> 
#>  5 <chr [8]> 
#>  6 <chr [8]> 
#>  7 <chr [8]> 
#>  8 <chr [6]> 
#>  9 <chr [5]> 
#> 10 <chr [10]>

# dplyr in a dataframe
df |> 
  tidyr::unnest(x) |> 
  dplyr::count(x)
#> # A tibble: 36 × 2
#>    x               n
#>    <chr>       <int>
#>  1 Alabama         1
#>  2 Alaska          1
#>  3 Arkansas        4
#>  4 California      3
#>  5 Colorado        5
#>  6 Connecticut     1
#>  7 Delaware        3
#>  8 Florida         1
#>  9 Georgia         3
#> 10 Hawaii          2
#> # … with 26 more rows

# vctrs
vctrs::vec_count(unlist(df$x))
#>               key count
#> 1        Colorado     5
#> 2       Louisiana     5
#> 3    North Dakota     4
#> 4     Mississippi     4
#> 5        Arkansas     4
#> 6        Delaware     3
#> 7         Vermont     3
#> 8       Minnesota     3
#> 9            Utah     3
#> 10     California     3
#> 11        Georgia     3
#> 12        Indiana     2
#> 13       Missouri     2
#> 14  New Hampshire     2
#> 15       Maryland     2
#> 16       Nebraska     2
#> 17         Hawaii     2
#> 18     New Jersey     2
#> 19       Oklahoma     2
#> 20  Massachusetts     1
#> 21       Illinois     1
#> 22          Texas     1
#> 23    Connecticut     1
#> 24   Rhode Island     1
#> 25       Michigan     1
#> 26       New York     1
#> 27           Ohio     1
#> 28         Nevada     1
#> 29        Florida     1
#> 30        Montana     1
#> 31      Wisconsin     1
#> 32        Alabama     1
#> 33         Alaska     1
#> 34 North Carolina     1
#> 35     Washington     1
#> 36         Kansas     1

Created on 2022-10-07 with reprex v2.0.2

Edit

If you list is actually a character vector, you'll need to do some string parsing.

# "list" but are actually strings
x <- c(
  "[['Flood (Disaster Type)', 'Burlington (City/Town/Village)', 'Ontario (City/Town/Village)']]",                                                                                                                                                                                                                                                                                                         
  "[0]",                                                                                                                                                                                                                                                                                                         
  "[0]",                                                                                                                                                                                                                                                                                                         
  "[['Rocket (Product Category)', 'Interview (TV Genre)', 'Canadian Broadcasting Corporation (TV Network)', 'Israel (Country)', 'Gaza War (Military Conflict)']]",                                                                                                                                                                                                                                                                                                         
  "[0]",                                                                                                                                                                                                                                                                                                         
  "[['Iraq (Country)', 'Military (Film Genre)', 'United States Of America (Country)']]",                                                                                                                                                                                                                                                                                                         
  "[['Ebola (Disease Or Medical Condition)', 'Chair', 'Margaret Chan (Physician)', 'WHO']]",                                                                                                                                                                                                                                                                                                         
  "[['CBC Television (TV Network)', 'CBC News (Website Owner)', 'Canadian Broadcasting Corporation (TV Network)']]",                                                                                                                                                                                                                                                                                                         
  "[['Rob Ford (Politician)', 'the fifth estate', 'CBC Television (TV Network)', 'Bill Blair', 'Gillian Findlay', 'Documentary (TV Genre)']]",                                                                                                                                                                                                                                                                                                         
  "[['B.C.', 'Dog Walking (Profession)', 'dogs', 'dog walker', 'death', 'dead']]",                                                                                                                                                                                                                                                                                                         
  "[['Suicide Of Amanda Todd (Event)', 'Amanda Todd', 'cyberbullying', 'CBC Television (TV Network)', 'the fifth estate', 'Mark Kelley', 'cappers', 'Documentary (TV Genre)']]",                                                                                                                                                                                                                                                                                                         
  "[['National Hockey League (Sports Association)', 'Climate Change (Website Category)', 'Hockey (Sport)', 'greenhouse gas', 'emissions']]",                                                                                                                                                                                                                                                                                                         
  "[['Rob Ford (Politician)', 'bomb threat', 'Toronto (City/Town/Village)', 'City Hall (Building)']]",                                                                                                                                                                                                                                                                                                         
  "[['Blue Jays', 'Ashes', 'friends']]",                                                                                                                                                                                                                                                                                                         
  "[['Robin Williams (Celebrity)', 'Peter Gzowski']]"
)

# assing to a data.frame
df <- data.frame(x = x)


df |> 
  dplyr::mutate(
    # remove square brackets at beginning or end
    x = gsub("^\\[{1,2}|\\]{1,2}$", "", x),
    # separate the strings into an actual list
    x = strsplit(x, "',\\s|,\\s'")
  ) |> 
  # unnuest the list column so they appear as individual rows
  tidyr::unnest(x) |> 
  # some extract cleaning to string out the '
  dplyr::mutate(x = gsub("^'|'$", "", x)) |> 
  # count the individual elements
  dplyr::count(x, sort = TRUE)
#> # A tibble: 47 × 2
#>    x                                                  n
#>    <chr>                                          <int>
#>  1 0                                                  3
#>  2 CBC Television (TV Network)                        3
#>  3 Canadian Broadcasting Corporation (TV Network)     2
#>  4 Documentary (TV Genre)                             2
#>  5 Rob Ford (Politician)                              2
#>  6 the fifth estate                                   2
#>  7 Amanda Todd                                        1
#>  8 Ashes                                              1
#>  9 B.C.                                               1
#> 10 Bill Blair                                         1
#> # … with 37 more rows



# same result just working with the vector
x |> 
  gsub("^\\[{1,2}|\\]{1,2}$", "", x = _) |> 
  strsplit("',\\s|,\\s'") |> 
  unlist() |> 
  gsub("^'|'$", "", x = _) |> 
  vctrs::vec_count() # or table()
#>                                               key count
#> 1                     CBC Television (TV Network)     3
#> 2                                               0     3
#> 3                           Rob Ford (Politician)     2
#> 4                                the fifth estate     2
#> 5                          Documentary (TV Genre)     2
#> 6  Canadian Broadcasting Corporation (TV Network)     2
#> 7                            City Hall (Building)     1
#> 8              United States Of America (Country)     1
#> 9                                     Mark Kelley     1
#> 10                               Israel (Country)     1
#> 11                                     Bill Blair     1
#> 12                           Interview (TV Genre)     1
#> 13                                      Blue Jays     1
#> 14                                 Hockey (Sport)     1
#> 15                                        friends     1
#> 16                                  Peter Gzowski     1
#> 17                 Suicide Of Amanda Todd (Event)     1
#> 18                                 greenhouse gas     1
#> 19                       Dog Walking (Profession)     1
#> 20                          Flood (Disaster Type)     1
#> 21    National Hockey League (Sports Association)     1
#> 22                                    Amanda Todd     1
#> 23                                          Chair     1
#> 24                                     dog walker     1
#> 25                                    bomb threat     1
#> 26                                           dogs     1
#> 27              Climate Change (Website Category)     1
#> 28                     Robin Williams (Celebrity)     1
#> 29                      Margaret Chan (Physician)     1
#> 30                                  cyberbullying     1
#> 31                                          Ashes     1
#> 32                    Ontario (City/Town/Village)     1
#> 33                                 Iraq (Country)     1
#> 34                                            WHO     1
#> 35                                        cappers     1
#> 36                                Gillian Findlay     1
#> 37                          Military (Film Genre)     1
#> 38                       CBC News (Website Owner)     1
#> 39                                           B.C.     1
#> 40           Ebola (Disease Or Medical Condition)     1
#> 41                    Toronto (City/Town/Village)     1
#> 42                                          death     1
#> 43                                      emissions     1
#> 44                      Rocket (Product Category)     1
#> 45                   Gaza War (Military Conflict)     1
#> 46                                           dead     1
#> 47                 Burlington (City/Town/Village)     1

Created on 2022-10-08 with reprex v2.0.2

Jordan
  • 169
  • 1
  • 6
  • Unfortunately, both of your suggestions returned only counts of row objects, not list objects. Check my edit above for an idea of the data I am working with. – BeaverFever Oct 07 '22 at 16:55
  • @BeaverFever, I think there's some terminology confusion. The strings in the character vector you included in your update look like Python lists but in R they are just a multi-length character vector. A list in R will behave differently. My edit shows how to parse those strings into list then into a character vector, which I think is what you want to do. – Jordan Oct 08 '22 at 15:53
  • Holy smokes that worked! You are right: I did scrape the data using Python. I think I may have actually caused it to use double brackets in the scraping process, as I assigned "0" as null values and I think this just added a second set of brackets, or whatever. Thanks – BeaverFever Oct 08 '22 at 16:13
0

It looks like you need unnest_longer():

library(dplyr)
library(tidyr)

df <- tibble(
  Tags = list(
    list('Dog', 'Cat', 'Mouse', 'Fish'),
    list('Cat', 'Fish'),
    list('Cat', 'Fish')
  )
)

df %>% 
  tidyr::unnest_longer(Tags) %>% 
  count(Tags) %>% 
  arrange(desc(n))
#> # A tibble: 4 × 2
#>   Tags      n
#>   <chr> <int>
#> 1 Cat       3
#> 2 Fish      3
#> 3 Dog       1
#> 4 Mouse     1
Matt
  • 7,255
  • 2
  • 12
  • 34
  • I tried this, but same results as Jordan's suggestions above. Check my edit in OP for an idea of the data I am working with. Maybe an issue with double list, but that's how YT data API churned it out – BeaverFever Oct 07 '22 at 16:56
  • If you have your data in RStudio, can you type `dput(data)` and copy/paste that? – Matt Oct 07 '22 at 17:01
  • For a number of reasons, I don't want to do that. I provided a relevant sample above. The issue appears to be how to work this with double brackets – BeaverFever Oct 07 '22 at 17:07
  • https://stackoverflow.com/questions/1169456/the-difference-between-bracket-and-double-bracket-for-accessing-the-el Here is an interesting article on how double brackets works with tidyverse (check out the pepper visuals lol). Unfortunately I think the example explains only how to pluck instances of specific objects within the lists, but I want them all and counted by unique value. – BeaverFever Oct 07 '22 at 17:22