2

I do have 2 tables where in the table data frame consist data with comma separated values.

Need to find out which category present in lookup data frame i.e. in Option 1 out of "Cat A, Cat B, Cat C" which category is present. if found, need to show the score against it.

incase multiple instance found, we need to show the score of first occurrence hence in first case Cat C and Cat Z both comes under Option 1 however we need to show the score for Cat C only which is 2.

Table

Option           | Cat 
-----             ------   
Option 1         | Cat A, Cat B, Cat C, Cat Z
Option 2         | Cat X, Cat Y, Cat Z, Cat B
Option 3         | Cat P, Cat Q, Cat R, Cat S, Cat T 
Option 1         | Cat T 
Option 3         | Cat E, Cat F 

Lookup

Cat   |   Option    |   Score

Cat A |   Option 2   |   8
Cat B |   Option 3   |   3
Cat C |   Option 1   |   2
Cat X |   Option 3   |   9
Cat Y |   Option 1   |   1
Cat Z |   Option 2   |   1
Cat P |   Option 3   |   2
Cat Q |   Option 2   |   9
Cat R |   Option 1   |   4
Cat S |   Option 4   |   0
Cat T |   Option 1   |   5
Cat E |   Option 4   |   1
Cat F |   Option 3   |   10

Output

 Option      |     Cat                                |     Found_cat    |     Score
 Option 1    |     Cat A, Cat B, Cat C                |     Cat C        |     2
 Option 2    |     Cat X, Cat Y, Cat Z, Cat B          |     Cat Z        |     1
 Option 3    |     Cat P, Cat Q, Cat R, Cat S, Cat T  |     Cat P        |     2
 Option 1    |     Cat T                              |     Cat T        |     5
 Option 3    |     Cat E, Cat F                       |     Cat F        |     10

R Code for creating dataframe

Table <- data.frame(Option = c("Option 1", "Option 2", "Option 3", "Option 1", "Option 3"),
                 Cat = c("Cat A, Cat B, Cat C", "Cat X, Cat Y,Cat Z, Cat B", "Cat P, Cat Q, Cat R, Cat S, Cat T", "Cat T", "Cat E, Cat F"))

lookup <-data.frame(Cat = c("Cat A", "Cat B", "Cat C", "Cat X", "Cat Y", "Cat Z", "Cat P", "Cat Q", "Cat R", "Cat S", "Cat T", "Cat E", "Cat F"),
                    Options = c("Option 2", "Option 3", "Option 1", "Option 3", "Option 1", "Option 2", "Option 3", "Option 2", "Option 1", "Option 4", "Option 1", "Option 4", "Option 3"),
                    Score = c(8, 3, 2, 9, 1, 1, 2, 9, 4, 0, 5, 1, 10)) 



output <- data.frame(Option = c("Option 1", "Option 2", "Option 3", "Option 1", "Option 3"),
                     Cat = c("Cat A, Cat B, Cat C", "Cat X, Cat Y, Cat Z, Cat B,", "Cat P, Cat Q, Cat R, Cat S, Cat T", "Cat T", "Cat E, Cat F"),
                     Found_cat = c("Cat C", "Cat Z", "Cat P", "Cat T", "Cat F"),
                     Score = c(2, 1, 2, 5, 10))
ayush varshney
  • 517
  • 7
  • 20
  • Thanks @GGamba for your edit, Would you be able to help me on this? – ayush varshney Jun 07 '17 at 18:23
  • Hi Ayush, what have you tried so far? Or have you just set up the data? Hopefully you'll share with us what you've attempted so we have a jumping off point. – Joy Jun 07 '17 at 18:37
  • Hi Joy, Thanks for your comment, as of now i am not able to land on starting point. I just identified the approach. though i am also working on this using for loop however this might not be a idle approach because in my requirement there are more than 25k+ rows. :( – ayush varshney Jun 07 '17 at 18:38
  • Since i was in big hurry and exhausted now by working more than 25 hours now hence your help would be really appreciable. – ayush varshney Jun 07 '17 at 18:41
  • 1
    I think a first step for you is to un-aggregate the comma separated data into separate rows. See https://stackoverflow.com/questions/13773770/split-comma-separated-column-into-separate-rows/31514711#31514711 for several approaches. Then you can do a merge to combine your cleaned up data. – Joy Jun 07 '17 at 18:45

2 Answers2

2

I just give a quick try using loop:

Table <- data.frame(Option = c("Option 1", "Option 2", "Option 3", "Option 1", "Option 3"),
                Cat = c("Cat A, Cat B, Cat C", "Cat X, Cat Y, Cat Z, Cat B", "Cat P, Cat Q, Cat R, Cat S, Cat T", "Cat T", "Cat E, Cat F"), 
                stringsAsFactors = F)

lookup <-data.frame(Cat = c("Cat A", "Cat B", "Cat C", "Cat X", "Cat Y", "Cat Z", "Cat P", "Cat Q", "Cat R", "Cat S", "Cat T", "Cat E", "Cat F"),
                Options = c("Option 2", "Option 3", "Option 1", "Option 3", "Option 1", "Option 2", "Option 3", "Option 2", "Option 1", "Option 4", "Option 1", "Option 4", "Option 3"),
                Score = c(8, 3, 2, 9, 1, 1, 2, 9, 4, 0, 5, 1, 10), 
                stringsAsFactors = F)
app = matrix(nrow = nrow(Table), ncol = 2)
for (i in 1:nrow(Table)) {
lookup.cats = lookup$Cat[lookup$Options == Table$Option[i]]
Table.cats = unlist(strsplit(Table$Cat[i], split = ', '))
found.cat = intersect(lookup.cats, Table.cats)
score = lookup$Score[which(lookup$Cat == found.cat & 
                               lookup$Options == Table$Option[i])]
app[i, 1] = found.cat
app[i, 2] = score
}
app = as.data.frame(app)
names(app) = c('Found_cat', 'Score')
cbind(Table, app)
Sean Lin
  • 805
  • 4
  • 12
  • Awesome Sean, You are really a savior. A BIG THANK YOU! – ayush varshney Jun 07 '17 at 18:51
  • @ayushvarshney note that there is a missing space before 'Cat Z' in your data frame Table, I couldn't edit your question now, but I fixed it in answer. – Sean Lin Jun 07 '17 at 18:55
  • Thanks for pointing this out, i corrected it in my question as well, just a suggestion to check sotos code as well which seems to be really fast for larger data sets.... – ayush varshney Jun 07 '17 at 19:02
  • @ayushvarshney yes, I was trying a short version after trying loop, and his code pops up, really awesome one. – Sean Lin Jun 07 '17 at 19:10
  • No problem, You spend time for my problem which is really nice. Cannot share my gratitudes with words. Cheers! – ayush varshney Jun 07 '17 at 19:13
1

Here is a tidyverse option,

library(tidyverse)

Table %>% 
  unnest(Cat = strsplit(as.character(Cat), ', ')) %>% 
  inner_join(lookup, by = c('Option', 'Cat')) %>%
  select(Cat, Score) %>% 
  rename(Cat_Found = Cat) %>% 
  bind_cols(Table, .)

#    Option                               Cat Cat_Found Score
#1 Option 1               Cat A, Cat B, Cat C     Cat C     2
#2 Option 2        Cat X, Cat Y, Cat Z, Cat B     Cat Z     1
#3 Option 3 Cat P, Cat Q, Cat R, Cat S, Cat T     Cat P     2
#4 Option 1                             Cat T     Cat T     5
#5 Option 3                      Cat E, Cat F     Cat F    10
ayush varshney
  • 517
  • 7
  • 20
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • 1
    Really nice approach Sotos, thanks a lot for pointing out the inconsistencies however i really like the way you handled the inconsistencies with your code. i have amended the changes in my question as well for the suggested second point however i am unable to get your first point. – ayush varshney Jun 07 '17 at 19:05
  • Thanks. Minor note here: If you get any warnings while running it it's due to some of your variables being factors. Nothing to worry about :) – Sotos Jun 07 '17 at 19:06
  • anyways, thanks a lot for your time and help here, much appreciated. – ayush varshney Jun 07 '17 at 19:06
  • My first comment was that in your reproducible data, the delimeter between CatY and CatZ in the second row string of Table, was missing a comma...just a tiny typo – Sotos Jun 07 '17 at 19:07
  • could you please help me once again, so lets say table consist Option 1 | Cat A, Cat B, Cat C, Cat T in this case i wanted to show the score for first occurrence i.e. Cat C score, how can i handle this situation. – ayush varshney Jun 07 '17 at 20:10