0

I am currently analyzing this dataset using R: https://www.kaggle.com/datasets/arnabchaki/goodreads-best-books-ever

I want to count which genres are the most occurring ones in this dataset. How do I do this, as genres is a string variable that contains multiple genres in this format: ['Religion', 'Nonfiction', 'Philosophy', 'Spirituality', 'Psychology', 'Theology', 'Cults', 'Self Help', 'Horror', 'Pseudoscience']. How can I transform this variable so that I can calculate the number of times each single genre occurred? Thanks in advance!

I tried:

df %>%
  count(genres) %>%
  arrange(desc(n))

But this will only calculate how many times a specific order of genres occurred, or it will only work if the genres variable contains a single genre.

  • That column is in a JSON data format. You need to unpack it somehow - you could convert your data to a long format where you get a row for each genre for each book, or a wide format where you create a column for each genre. If you search the R tag for "JSON column" you'll find plenty of examples, e.g., [this one for unpacking wider](https://stackoverflow.com/q/66095361/903061) or [this one which looks to be for a long format](https://stackoverflow.com/q/59133043/903061). – Gregor Thomas May 31 '23 at 14:04

3 Answers3

0

Not knowing exactly what your data looks like, here's a somewhat tentative answer:

library(dplyr)
library(tidyr)
df %>%
  # separate the genres each into their own row:
  separate_rows(strings) %>%
  # remove empty values (generated for "[]"):
  filter(strings != "") %>%
  # for each genre...
  group_by(strings) %>%
  #... count occurrences:
  summarise(Freq = n())
# A tibble: 11 × 2
   strings        Freq
   <chr>         <int>
 1 Cults             1
 2 Help              2
 3 Horror            2
 4 Nonfiction        1
 5 Philosophy        2
 6 Pseudoscience     2
 7 Psychology        1
 8 Religion          2
 9 Self              2
10 Spirituality      2
11 Theology          2

Data:

df <- data.frame(
  strings = c("['Religion', 'Philosophy', 'Spirituality', 'Theology', 'Cults', 'Self Help', 'Horror', 'Pseudoscience']",
    "['Religion', 'Nonfiction', 'Philosophy', 'Spirituality', 'Psychology', 'Theology', 'Self Help', 'Horror', 'Pseudoscience']")
)
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

I found a solution that works:

First I removed the [ and ] symbols in the genres data by:

data$genres <- gsub("\\[|\\]", "", data$genres)

Then I created a new row for each single genre, which are separated by comma and then I can count the number of times each single genre occurred by:

data %>% 
 separate_longer_delim(genres, delim = ",") %>%
 count(genres) %>%
 arrange(desc(n))

I found the solution by trying to understand the separate_rows function as Chris Ruehlemann suggested. Thanks!

-2

The variable contains list of strings, which is seperated by comma",". Therefore my recommendation is to make unlist this variable by row with given code here and use the code below, which will now group by genres and count each occurence. After generating the long list I recommend to save them as temp file so that it does not take lots of your computational resource.

try :

 df %>%
      mutate(genres = as.factor(genres))%>% # transforming as catergorical variable
      group_by(genres) %>%
      summarize(n = n()) %>%
      arrange(desc(n))
 
Jin_soo
  • 65
  • 6
  • 2
    This does the same as the code that I tried. I think I probably have to transform the genres variable in order for that to work, but I don't know how.. – Luca4815162342 May 31 '23 at 13:51
  • maybe creating tmp file base on [here](https://stackoverflow.com/questions/24249351/split-a-column-by-group) and run the code. – Jin_soo May 31 '23 at 14:02
  • Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, **can you [edit] your answer to include an explanation of what you're doing** and why you believe it is the best approach? – Jeremy Caney Jun 01 '23 at 00:12