0

I have a huge list of data that contains information about person and its product complaint reports submitted to FDA for foods, dietary supplements, and cosmetics. My data is cleaned up and then I create the matrix that contains 0 and 1:

syms <- strsplit(dat$symptoms, ", ")
tm   <- matrix(0, nrow=nrow(dat), ncol=length(unique(unlist(syms))))
colnames(tm) <- unique(unlist(syms))

for(i in 1:length(syms)) {
  tm[i, syms[[i]]] <- 1
}
dat$symptoms <- NULL

The 'dat' contains data of complaints of the patient:

received id ... product outcome
9/30/2022 2022-CFS-014640 ... centrum silver men's 50+ other outcome
9/30/2022 2022-CFS-014637 ... liquid collagen shot life threatening

and the 'tm' has the matrix of symptoms:

diarrhoea vomiting cancer
0 1 0
1 0 0
... ... 1

I need to find the list of products that person should avoid if it doesn't want to get cancer. I tried this:

# Find rows in tm matrix where the "cancer" symptom is present
cancer_rows <- which(tm[, "cancer"] == 1)

# Create a vector of product names associated with "cancer" symptoms
products_to_avoid <- unique(dat$product[cancer_rows])

but this doesn't work for me. Maybe someone has any ideas how can I write it properly?

cinnamond
  • 79
  • 6
  • Welcome to stackoverflow. What we need is a question (already done), a minimal example data frame (is lacking, you can do it with dput(df) for example), and the addition of the desired output would be great. This is important in two ways: First, you will increase your learning curve dramatically by doing so and second, you will make us happy to be able to help! See How to make a great R reproducible example: – TarJae Mar 09 '23 at 06:48

1 Answers1

1

You can filter by symptoms using regex without making a variable for each symptom (note that this only works before you set dat$symptoms to NULL):

unique(dat$product[grepl("cancer", dat$symptoms)])

For extracting symptoms, you could also use a tidyverse approach to easily keep it within the same data frame. For example:

library(dplyr)
library(tidyr)
library(tibble)

dat_syms <-
  dat %>%
  mutate(
    syms = symptoms %>%
      strsplit(", ") %>%
      lapply(table) %>%
      lapply(as.data.frame)
  ) %>%
  unnest(syms) %>%
  spread(Var1, Freq, fill = 0)

unique(dat_syms$product[dat_syms$cancer == 1])

However, it is important note that while this lists products where customers complained about cancer, it is likely not very informative about whether or not those products should be avvoided. To be informative you would have to make very strong assumptions about the data, e.g. that customers who complain actually know that it was indeed that product which caused their cancer---which obviously is not true.