1

I am relatively new in R.

I have two dataframes, each of one variable only called final and cv.

final looks like:

V1
humans, aged, female, stroke
infant, male, echocardiography
aneurysm, adolescent, female, diabetes
pregnant, diabetes, female
cardiovascular diseases, complications

and cv looks like

V2
stroke
pregnant
echocardiography
aneurysm
diabetes
cardiovascular diseases

I want to manipulate final so that it only includes the text present in cv. This is what I want the resulting dataframe of final to look like:

V1
stroke
echocardiography
aneurysm, diabetes
pregnant, diabetes
cardiovascular diseases

Please advise. Thanks!

sweetmusicality
  • 937
  • 1
  • 10
  • 27
  • 1
    Do not copy/paste your data here. Please read [How to make a great reproducible example in R?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – M-- Aug 11 '17 at 21:09

1 Answers1

4

We can use functions from dplyr and stringr. In addition, the or1 function from rebus is very useful to construct regular expression phrases. str_extract_all can extract all the matched string. If there are more than one phrases, the output of str_extract_all will create something like c("aneurysm", "diabetes"). I used several str_replace call with fixed to replace c(, ), and " to nothing. This part can be done more efficiently using regex, but I am not familiar with regex. df_final is the final output.

# Load packages
library(dplyr)
library(stringr)
library(rebus)

# Create example data frame
df1 <- data_frame(V1 = c("humans, aged, female, stroke", "infant, male, echocardiography",
                         "aneurysm, adolescent, female, diabetes", "pregnant, diabetes, female",
                         "cardiovascular diseases, complications"))
df2 <- data_frame(V2 = c("stroke", "pregnant", "echocardiography", "aneurysm", 
                         "diabetes", "cardiovascular diseases"))

# Process the data
df_final <- df1 %>%
  mutate(V1 = str_extract_all(V1, or1(df2$V2))) %>%
  mutate(V1 = str_replace(V1, fixed("c("), "")) %>%
  mutate(V1 = str_replace(V1, fixed(")"), "")) %>%
  mutate(V1 = str_replace_all(V1, fixed('"'), ""))
www
  • 38,575
  • 12
  • 48
  • 84
  • thanks, but your code only extracted the first phrase that shows up in each row. for instance, in the third row, I want both `aneurysm, diabetes` but your code only outputs `aneurysm` – sweetmusicality Aug 11 '17 at 23:58
  • @sweetmusicality Please see my updates. I believe `df_final` matches your expected output now. – www Aug 12 '17 at 00:32
  • thanks - I will try this. could you update your variables however so that both dataframes have different variables (`V1` and `V2`) because I'm afraid that I will use the wrong variables otherwise. thanks! – sweetmusicality Aug 15 '17 at 17:14
  • 1
    Based on the update from your original update, I have also updated my code accordingly. – www Aug 15 '17 at 17:21
  • how would I go about string matching if `final$V1` doesn't have phrases separated by commas (no separation at all - and still need exact match from `cv$V2`? – sweetmusicality Oct 06 '17 at 18:27
  • @sweetmusicality You may want to ask a new question. – www Oct 06 '17 at 18:32
  • I have, but no one is answering it, yikes: https://stackoverflow.com/questions/46592850/finding-all-string-matches-from-another-dataframe-in-r – sweetmusicality Oct 06 '17 at 18:55