-1

Reproduceable dataset

   data1 <- data.frame(ID = c(1,2), Description = c("Chiquita","Chiquita mazamorra"), Max = c(200,125))
   data2 <- data.frame(ID = c(1,2,3,4,5,6,7), Description = c("Chiquita mini", "Chiquita Oriville","Chiquita 24h","Manzano Chiquita 5j...","Chiquita mazamorra 1,2h..","Chiquita mazamorra Buro","Chiquita AM 2F"), Max = c(24,110,80,90,134,123,210))

I have a dataset , data1, as shown below

  Id     Description            Max
  1      Chiquita               200
  2      Chiquita mazamorra     125

I have another dataset, data2, as shown below

  Id     Description                   Actual
  1      Chiquita mini                 24
  2      Chiquita Oriville             110
  3      Chiquita 24h                  80
  4      Manzano Chiquita 5j...        90
  5      Chiquita mazamorra 1,2h...    134
  6      Chiquita mazamorra Buro       123
  7      Chiquita AM 2F                210
  8      Chiquita.....                 124
  9      Chiquita(P)                   213
  10     Chiquita, mazamorra, S        188                   

If statement should check if the Data2 Description contains this character in data2$Description Chiquita mazamorra, if yes then check if Data2$Actual > Data1$Max. If Yes then Results == Good, else Small. Note there can be other characters after Chiquita mazamorra like for example Chiquita mazamorra 1,2h.. this is okay, but not Chiquita mazamorra Buro

Similarly another ifelse should check if the Data2 Description contains Chiquita if yes then check if Data2$Actual > Data1$Max. If Yes then Results == Good, else Small. There can be other characters after Chiquita like for example Chiquita 24h or Chiquita AM 2F these are okay but not Chiquita mini or Chiquita Oriville

This is the final desired output (data2)

  Id     Description                   Actual      Result
  1      Chiquita mini                 24          NA
  2      Chiquita Oriville             110         NA
  3      Chiquita 24h                  80          Small
  4      Manzano Chiquita 5j...        90          NA
  5      Chiquita mazamorra 1,2h...    134         Good         
  7      Chiquita mazamorra Buro       123         NA
  6      Chiquita AM 2F                210         Good
  8      Chiquita.....                 124         Small
  9      Chiquita(P)                   213         NA
  10     Chiquita, mazamorra, S        188         Good

I have know this can be done using a combination of grepl and ifelse statements I am not very confident ? Maybe there is a better way of doing this, I dont know, I am getting very confused. Need help.

  • Please provide reproducible example if you really need help: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Metrics Feb 23 '15 at 00:39

1 Answers1

0

Here's a general outline of a solution

data1 <- read.csv(text=
"Id,Description,Max
1,Chiquita,200
2,Chiquita mazamorra,125")

data2 <- read.csv(text=
"Id,Description,Actual
1,Chiquita mini,24
2,Chiquita Oriville,110
3,Chiquita 24h,80
4,Manzano Chiquita 5j,90
5,Chiquita mazamorra 12h,134
6,Chiquita mazamorra Buro,123
7,Chiquita AM 2F,210")


# start by trimming the description to the first few words 
# that don't start with a number
data2$Description_trimmed <- gsub('\\s+\\d.*$','',data2$Description)

# initialize the output field
data2$Results <- NA

# loop while there are missing values in data$Results
while(any(is.na(data2$Results))){

    # identify records that still need to be calculated
    indx <- is.na(data2$Results)

    # calculate the result based on the current trimmed description
    data2[indx,'Results']  <-  ifelse(
                data2[indx,'Actual']  < 
                    data1[match(data2[indx,'Description_trimmed'],
                                data1[    ,'Description']),
                          "Max"],
                'Good',
                'Small')

    # trim the last word from Description_trimmed
    data2$Description_trimmed <- gsub('(^| +)[^ ]*$','',data2$Description_trimmed)

    # stop if the remaining trimmed descriptions are empty
    if(all(grepl('^\\s*$',data2$Description_trimmed)))
        break
}

data2
#>   Id             Description Actual Description_trimmed Results
#> 1  1           Chiquita mini     24                        Good
#> 2  2       Chiquita Oriville    110                        Good
#> 3  3            Chiquita 24h     80                        Good
#> 4  4     Manzano Chiquita 5j     90                        <NA>
#> 5  5  Chiquita mazamorra 12h    134                       Small
#> 6  6 Chiquita mazamorra Buro    123                        Good
#> 7  7          Chiquita AM 2F    210                       Small

(BTY, this solution calculates is.na(data$Results) twice per loop, whereas you only really need to calculate it once - I was going for ease of reading not efficiency in this aspect...)

Jthorpe
  • 9,756
  • 2
  • 49
  • 64
  • tried your solution the code is going into an infinite loop, also I am confused with some variables you referenced in the proposed solution for example data$Results ?? I am assuming this is data2$Results :) – Bridgeport Tucker Jr Feb 23 '15 at 03:33
  • :) that fix worked, but the logic dosent pick up cases where Description has trailing ...... for instance row 8, Data2, since this observation belongs to rule 1 in data1 (data1, row1), I was hoping the results would show small or good but right now it shows NA :) , similarly when there is a , in between which is row 10 data2. I apologize this is a very wild dataset and I am very thankful for help. – Bridgeport Tucker Jr Feb 23 '15 at 04:31
  • You will have to tailor the regular expressions to your dataset. For example, substituting `'\\s+\\d.*$'` for `'\\s+[0-9.].*$'` might help with the trailing `'...'` problem. I suggest trying out https://regex101.com/ and testing some expressions with the strings that are giving you fits. :) – Jthorpe Feb 23 '15 at 05:42