Only keeping string text present in another dataframe in R

Question

I am relatively new in R.

I have two dataframes, each of one variable only called final and cv.

final looks like:

V1
humans, aged, female, stroke
infant, male, echocardiography
aneurysm, adolescent, female, diabetes
pregnant, diabetes, female
cardiovascular diseases, complications

and cv looks like

V2
stroke
pregnant
echocardiography
aneurysm
diabetes
cardiovascular diseases

I want to manipulate final so that it only includes the text present in cv. This is what I want the resulting dataframe of final to look like:

V1
stroke
echocardiography
aneurysm, diabetes
pregnant, diabetes
cardiovascular diseases

Please advise. Thanks!

Do not copy/paste your data here. Please read [How to make a great reproducible example in R?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — M--, Aug 11 '17 at 21:09

www · Accepted Answer · 2017-08-15T17:20:16.353

4

We can use functions from dplyr and stringr. In addition, the or1 function from rebus is very useful to construct regular expression phrases. str_extract_all can extract all the matched string. If there are more than one phrases, the output of str_extract_all will create something like c("aneurysm", "diabetes"). I used several str_replace call with fixed to replace c(, ), and " to nothing. This part can be done more efficiently using regex, but I am not familiar with regex. df_final is the final output.

# Load packages
library(dplyr)
library(stringr)
library(rebus)

# Create example data frame
df1 <- data_frame(V1 = c("humans, aged, female, stroke", "infant, male, echocardiography",
                         "aneurysm, adolescent, female, diabetes", "pregnant, diabetes, female",
                         "cardiovascular diseases, complications"))
df2 <- data_frame(V2 = c("stroke", "pregnant", "echocardiography", "aneurysm", 
                         "diabetes", "cardiovascular diseases"))

# Process the data
df_final <- df1 %>%
  mutate(V1 = str_extract_all(V1, or1(df2$V2))) %>%
  mutate(V1 = str_replace(V1, fixed("c("), "")) %>%
  mutate(V1 = str_replace(V1, fixed(")"), "")) %>%
  mutate(V1 = str_replace_all(V1, fixed('"'), ""))

edited Aug 15 '17 at 17:20

answered Aug 11 '17 at 21:15

www

38,575
12
48
84

thanks, but your code only extracted the first phrase that shows up in each row. for instance, in the third row, I want both `aneurysm, diabetes` but your code only outputs `aneurysm` – sweetmusicality Aug 11 '17 at 23:58
@sweetmusicality Please see my updates. I believe `df_final` matches your expected output now. – www Aug 12 '17 at 00:32
thanks - I will try this. could you update your variables however so that both dataframes have different variables (`V1` and `V2`) because I'm afraid that I will use the wrong variables otherwise. thanks! – sweetmusicality Aug 15 '17 at 17:14
1

Based on the update from your original update, I have also updated my code accordingly. – www Aug 15 '17 at 17:21
how would I go about string matching if `final$V1` doesn't have phrases separated by commas (no separation at all - and still need exact match from `cv$V2`? – sweetmusicality Oct 06 '17 at 18:27
@sweetmusicality You may want to ask a new question. – www Oct 06 '17 at 18:32
I have, but no one is answering it, yikes: https://stackoverflow.com/questions/46592850/finding-all-string-matches-from-another-dataframe-in-r – sweetmusicality Oct 06 '17 at 18:55

Only keeping string text present in another dataframe in R

1 Answers1