0

I have two unequal lengths of data frame.

The first df is 100K+rows containing Hotel Names with cities or without cities, example-

structure(list(Hotel Name = c("ancdr Wyndam NY vbhejn", "rifhwe Wynham SFO fgrnhie", "efuaschiw Marriott DC fgyweuinh", "hfeiwefj Marriott elsn"), Col2 = c("x", "x", "x", "x"), Col3 = c("x", "x", "x", "x" ), Col 4 = c("x", "x", "x", "x")), row.names = c(NA, -4L ), class = c("tbl_df", "tbl", "data.frame"))

The second df is ~100 rows containing city names, example -

Cities

NY

SFO

DC

My desired output should only be those rows from first data frame which have city names from second data frame:

Hilton NY
Marriott NY
Wyndham NY

I tried using %in% but it's returning an error saying unequal length of rows

  • 1
    Welcome to SO! You are most likely to get good help if you provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). For example you can just pull the first few rows of each dataframe using `dput(head(df1))` and `dput(head(df2))` and edit your question to include those so the community can help you. – Dan Adams Jan 25 '22 at 20:43
  • It sounds like an inner join would probably solve your problem depending on how things are formatted. See [this](https://stackoverflow.com/a/1300618/13210554) answer for details. E.g. `merge(x = df1, y = df2, by = "city")[,1:ncol(df1)]` – Dan Adams Jan 25 '22 at 20:54
  • This is still not in a format that is easy to copy-paste so we can test for you. Please try using the `dput()` function as suggested. – Dan Adams Jan 25 '22 at 21:39
  • Please see the updated Q – Sargam Gupta Jan 25 '22 at 21:50
  • You'll probably need to provide some additional info on how to sort the random text from the name of the hotel. – Dan Adams Jan 25 '22 at 22:09
  • the random text should be kept – Sargam Gupta Jan 25 '22 at 22:20
  • Ok, that's not the desired output you showed. Then my answer below seems like it get you what you want. If so you can [accept the answer](https://stackoverflow.com/help/someone-answers), otherwise please add more clarification to your question. – Dan Adams Jan 25 '22 at 23:24

2 Answers2

1
library(dplyr)
df <- tibble(hotels = c("Hilton", "Hilton", "Hilton", "Hilton", "Hilton", "Hilton"), cities1 = c("Washington", "NY", "San Francisco", "Warsaw", NA, "Wrocław"))
df2 <- tibble(cities2 = c('NY', 'Warsaw', 'Wrocław'))

df |>
  filter(cities1 %in% df2$cities2)
#> # A tibble: 3 × 2
#>   hotels cities1
#>   <chr>  <chr>  
#> 1 Hilton NY     
#> 2 Hilton Warsaw 
#> 3 Hilton Wrocław

Created on 2022-01-25 by the reprex package (v2.0.1)

Grzegorz Sapijaszko
  • 1,913
  • 1
  • 5
  • 12
  • Thanks but I cannot create df and df2, they are existing in the excel already with 1000s of rows – Sargam Gupta Jan 25 '22 at 21:04
  • If you want to operate on these data in `r` then it sounds like you need to first import the data from excel into `r`. I suggest using the [{readxl}](https://readxl.tidyverse.org/) package. – Dan Adams Jan 25 '22 at 21:05
  • Yep done that as the first step... now comparing two columns, one's value being contained in others – Sargam Gupta Jan 25 '22 at 21:10
  • From what you've described (without actualy sharing data to reproduce your issue) it sounds like the solutions offered should work. If you need more help you have to actually share some data. – Dan Adams Jan 25 '22 at 21:17
  • Just substitute df and df2 with your data. Ideally a small example will help, example: `dput(head(df))` – Grzegorz Sapijaszko Jan 25 '22 at 21:18
  • updated the question, thanks! – Sargam Gupta Jan 25 '22 at 21:41
1

To just filter df1 based on membership of df2$city in df1$Hotel_Name, you can use stringr::str_detect(). If you want to provide multiple options for the pattern to match you can separate them with |. Therefore I provided paste0(df2$city, collapse = "|") as the pattern to match. You can run just that line of code on it's own to see what it looks like.

library(tidyverse)

df1 <- data.frame(
    Hotel_Name = c(
      "ancdr Wyndam NY vbhejn",
      "rifhwe Wynham SFO fgrnhie",
      "efuaschiw Marriott DC fgyweuinh",
      "hfeiwefj Marriott elsn"
    ),
    Col2 = c("x", "x", "x", "x"),
    Col3 = c("x", "x", "x", "x"),
    Col4 = c("x", "x", "x", "x")
)

df2 <- data.frame(city = c("NY", "SFO", "DC"))

df1 %>% 
  filter(str_detect(Hotel_Name, paste0(df2$city, collapse = "|")))
#>                        Hotel_Name Col2 Col3 Col4
#> 1          ancdr Wyndam NY vbhejn    x    x    x
#> 2       rifhwe Wynham SFO fgrnhie    x    x    x
#> 3 efuaschiw Marriott DC fgyweuinh    x    x    x

Created on 2022-01-25 by the reprex package (v2.0.1)

Dan Adams
  • 4,971
  • 9
  • 28
  • Thanks Dan, I'm getting this error- Error: Problem with `filter()` input `..1`. i Input `..1` – Sargam Gupta Jan 26 '22 at 07:42
  • Could be a namespace collision. Try making it explicit with `dplyr::filter()` – Dan Adams Jan 26 '22 at 12:02
  • You can sniff this out proactively with [`conflicted::conflict_scout()`](https://conflicted.r-lib.org/reference/conflict_scout.html) and use the same package to resolve those issues in your code. – Dan Adams Jan 26 '22 at 12:13