2

I am trying to compare multiple columns in two different dataframes in R. This has been addressed previously on the forum (Compare group of two columns and return index matches R) but this is a different scenario: I am trying to compare if a column in dataframe 1 is between the range of 2 columns in dataframe 2. Functions like match, merge, join, intersect won't work here. I have been trying to use purr::pluck but didn't get far. The dataframes are of different sizes.

Below is an example:

temp1.df <- mtcars

temp2.df <- data.frame(
  Cyl = sample (4:8, 100, replace = TRUE),
  Start = sample (1:22, 100, replace = TRUE),
  End = sample (1:22, 100, replace = TRUE)
)

temp1.df$cyl <- as.character(temp1.df$cyl)
temp2.df$Cyl <- as.character(temp2.df$Cyl)

My attempt:

temp1.df <- temp1.df %>% mutate (new_mpg = case_when (
  temp1.df$cyl %in% temp2.df$Cyl & temp2.df$Start <= temp1.df$mpg & temp2.df$End >= temp1.df$mpg ~ 1
))

Error:

Error in mutate_impl(.data, dots) : 
  Column `new_mpg` must be length 32 (the number of rows) or one, not 100

Expected Result:

  1. Compare temp1.df$cyl and temp2.df$Cyl. If they are match then -->
  2. Check if temp1.df$mpg is between temp2.df$Start and temp2.df$End -->
  3. if it is, then create a new variable new_mpg with value of 1.

It's hard to show the exact expected output here.

I realize I could loop this so for each row of temp1.df but the original temp2.df has over 250,000 rows. An efficient solution would be much appreciated.

Thanks

KP1
  • 129
  • 2
  • 8

1 Answers1

1
temp1.df$new_mpg<-apply(temp1.df, 1, function(x) {
  temp<-temp2.df[temp2.df$Cyl==x[2],] 
  ifelse(any(apply(temp, 1, function(y) {
    dplyr::between(as.numeric(x[1]),as.numeric(y[2]),as.numeric(y[3]))
  })),1,0)
})

Note that this makes some assumptions about the organization of your actual data (in particular, I can't call on the column names within apply, so I'm using indexes - which may very well change, so you might want to rearrange your data between receiving it and calling apply, or maybe changing the organization of it within apply, e.g., by apply(temp1.df[,c("mpg","cyl")]....

At any rate, this breaks your data set into lines, and each line is compared to the a subset of the second dataset with the same Cyl count. Within this subset, it checks if any of the mpg for this line falls between (from dplyr) Start and End, and returns 1 if yes (or 0 if no). All these ones and zeros are then returned as a (named) vector, which can be placed into temp1.df$new_mpg.

I'm guessing there's a way to do this with rowwise, but I could never get it to work properly...

iod
  • 7,412
  • 2
  • 17
  • 36
  • The solution works when `cyl` and `Cyl` are `numeric`. I overlooked that my `cyl` and `Cyl` in the original dataset are characters. If so the above solution does not work. Gives ` Error in dplyr::between(x[1], y[2], y[3]) : Not compatible with requested type: [type=character; target=double]. ` – KP1 Nov 16 '18 at 16:12
  • Here's the code with column names. `temp1.df$cyl <- as.character(temp1.df$cyl) temp2.df$Cyl <- as.character(temp2.df$Cyl) temp1.df$mpg <- as.numeric(temp1.df$mpg) temp2.df$Start <- as.numeric(temp2.df$Start) temp2.df$End <- as.numeric(temp2.df$End) temp1.df$new_mpg<-apply(temp1.df [,c("mpg", "cyl")], 1, function(x) { temp<-temp2.df[temp2.df$Cyl==x["cyl"],] print (temp) ifelse(any(apply(temp, 1, function(y) { dplyr::between(x["mpg"],y["Start"],y["End"]) })),1,0) })` – KP1 Nov 16 '18 at 16:16
  • Just add `as.numeric()` to each of the arguments in `between`. I've adjusted the code above. – iod Nov 16 '18 at 16:23
  • For a datasets with 1101 and 256833 observations it takes a while but it works great. Thanks. – KP1 Nov 16 '18 at 17:02
  • yeah, it's an apply within an apply - not surprised it's taking a while... Glad it works for you nevertheless. – iod Nov 16 '18 at 17:16