1

I have this data frame glimpse(df)

Observations: 2,211
Variables: 3
$ city       <chr> "Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas", "Las Veg...
$ categories <chr> "c(\"Korean\", \"Sushi Bars\")", "c(\"Japanese\", \"Sushi Bars\")", "Tha...
$ is_open    <chr> "0", "0", "1", "0", "1", "1", "0", "1", "0", "1", "1", "1", "0", "1", "1...

Here is a small dput()

structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas", 
"Phoenix", "Las Vegas"), categories = c("c(\"Korean\", \"Sushi Bars\")", 
"c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")", 
"Korean"), is_open = c("0", "0", "1", "0", "1")), .Names = c("city", 
"categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")

The data consists different cities city with different cuisines categories. I want to make a contingency table to visualize which cuisines are associated with closings (is_opem = 0) or openings (is_open = 1).

I want to do this with a contingency table. To do so I tried this one but I got this error:

xtabs(is_open ~., data = df)

Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument

When I convert the variables as.factor() I get lots of tables, not one. Is there a way to this so ít´s looking like below?

Categorie/City          Las Vegas     Pittsburgh
           Korean       50/50         30/70
           Sushi Bars   40/60         40/60

The numbers in the columns are the frequencies of the closings (is_opem = 0) and openings (is_open = 1) for each category per city (e.g. for Korean in Las Vegas the distribution for closings(0) and openings(1) is 50/50).

Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
Banjo
  • 1,191
  • 1
  • 11
  • 28
  • please post the output of `dput(df)` instead of `glimpse(df)` – Roman Dec 13 '17 at 14:00
  • Could you explain your output again, I do not understand how these numbers are generated. To my understanding you either divide each row/col by total counts, rowsum or colsum in contingeny tables (or leave the counts). In addition, I think your data structure should not inlcude two categories in one cell, instead one row should be generated per category, this would at least resemble a correct "long format" or the data. – Manuel Bickel Dec 13 '17 at 14:23

1 Answers1

1

Here a solution using data.table to cast your data using a counting function based on stri_count from stringi package. The latter might also be achieved via table or sum(grepl()) with ifelse constructs (depending on the required flexibility with respect to the data structure, speed requirements, etc.). Please note that I have also reformatted your data into more clean "long format" with the aid of this answer. This reformatting might be skipped if you format your data this way right from the start. I hope this is what you are looking for.

#your data
df <- structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas")
                       ,categories = c("c(\"Korean\", \"Sushi Bars\")", 
                                     "c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")", 
                                     "Korean")
                       ,is_open = c("0", "0", "1", "0", "1"))
                       ,.Names = c("city",  "categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")

library(data.table)
library(stringi)                                  

#format data to correct "long format"
DT <- as.data.table(df)
DT[, categories := gsub("c\\(\"|\"|\"\\)", "", categories)]
DT <- DT[, .(categories = unlist(strsplit(as.character(categories), ", ", fixed = TRUE))), 
         by = .(city, is_open)]
#           city is_open categories
# 1:  Las Vegas       0     Korean
# 2:  Las Vegas       0 Sushi Bars
# 3: Pittsburgh       0   Japanese
# 4: Pittsburgh       0 Sushi Bars
# 5:  Las Vegas       1       Thai
# 6:  Las Vegas       1     Korean
# 7:    Phoenix       0 Sushi Bars
# 8:    Phoenix       0   Japanese

#specify all_unique_count_items to also cover items that are not present in x
calc_count_distr <-  function(x, all_unique_count_items) {

    count_distribution <- sapply(all_unique_count_items, function(y) {
                                     100*round(sum(stri_count_fixed(x, y))/length(x), d =2)
                                })
    paste(count_distribution, collapse = "/")
}

dcast.data.table(DT, categories ~ city, value.var = "is_open"
                 ,fun.aggregate = function(x) calc_count_distr(x, all_unique_count_items = unique(DT$is_open))
                 ,fill = NA)
#   categories Las Vegas Phoenix Pittsburgh
#1:   Japanese        NA   100/0      100/0
#2:     Korean     50/50      NA         NA
#3: Sushi Bars     100/0   100/0      100/0
#4:       Thai     0/100      NA         NA
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
  • 1
    you probably want to perform all of the cleaning _after_ converting to `data.table` – Ricardo Saporta Dec 13 '17 at 20:07
  • Thanks for the hint, changed it. (I know I could have piped the `data.table` commands during cleaning/reformatting, however, I wanted to maintain easy readability) – Manuel Bickel Dec 13 '17 at 20:24