Extract data from multiple columns in R data frame, then searching another

Question

I have a central data frame of information (df3) that I'm trying to subset and add columns to based on data extracted from several columns of another (df2), that itself comes from a subset of a third (df1). I've managed to get so far by searching help and playing around with various functions, but I have reached an impasse. I do hope you can help.

To begin with, the 3dfs are structured as follows:

#df1 - my initial search database
id <- c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8")
yesno <- c("Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "No")
city <- c("London", "London", "Paris", "London", "Paris", "New York", "London", "London")
df1 <- cbind(id, yesno, city)
df1 <- as.data.frame(df1)
df1

#df2 - containing the data needed to search df3, but situated across columns
id <- c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8")
twitter <- c("@one","", "@three", "@four", "", "", "@seven", "")
email <- c("", "", "", "add4", "add5","", "add7", "")
mail <- c("", "postcode2", "", "","","","","postcode8")
df2 <- cbind(id, twitter, email, mail)
df2 <- as.data.frame(df2)
df2

#df3 - the central df containing the data I wish to extract
comms <- c("@one", "postcode2", "@three", "@four", "add4", "add5", "six" "@seven", "add7", "postcode2")
target <- c("text1", "text2", "text3", "text4.1", "text4.2", "text5", "text6", "text7.1","text7.2", "text8")
df3 <- cbind(comms,target)
df3 <- as.data.frame(df3)
df3

The commonality between df1 and df2 is found in the id columns. I've so far been able to filter df1 and extract the ids, which I've then used to subset df2.

   df_search <- df1 %>%
   filter(yesno == "Yes", city == "London")

   df_search_ids <- df_search$id

   df2_search <- df2 %>%
   filter(id %in% df_search_ids)
   df2_search

       id twitter email      mail
     1 id1    @one                
     2 id2               postcode2
     3 id4   @four  add4          
     4 id7  @seven  add7

My problems are: the common data between df2 and df3 are spread across three different columns of df2 (twitter, email and mail); these columns contain blank cells and other extraneous info (e.g. 'I am not on Twitter'); and finally that some of the entries in df2 (such as id4 and id7 above) have more than one entry in df3.

The solution I am trying to reach is that I would like to extract all instances from the columns twitter, email and mail of df2 based on a match with the ids extracting from df1, so that the extracted info can then be applied to subset df3 and eventually results in a new df(target_res) that looks like this:

    id_res <- c("id1", "id2", "id4", "id4", "id7", "id7")
    comms_res <- c("@one", "postcode2", "@four", "add4", "@seven", "add7")
    target_res <- c("text1", "text2", "text4.1", "text4.2", "text7.1", "text7.2")
    result_df <- cbind(id_res, comms_res, target_res)
    result_df <- as.data.frame(result_df)
    result_df

      id_res comms_res target_res
    1    id1      @one      text1
    2    id2  postcode2      text2
    3    id4     @four    text4.1
    4    id4      add4    text4.2
    5    id7    @seven    text7.1
    6    id7      add7    text7.2

This is an action I will be performing a number of times (based on different explorations of df1), so ideally would be replicable.

I hope this is a clear explanation of the issue.

What about duplicates in lookup of df3? your df3 has two rows with `postcode2`. Do you want both, the first? — aichao, Aug 04 '16 at 17:30
Thanks for the response. I would like all instances from df3 where there is a match in the comms column with something from the twitter, email, mail columns in df2. There are numerous duplicates in the comms column, but the instances in target are unique, so I would like all of them. — Craig Hamilton, Aug 04 '16 at 17:46
Am playing around with str_match, but can't quite seem to make it work just yet. — Craig Hamilton, Aug 04 '16 at 17:47

score 2 · Accepted Answer · edited May 23 '17 at 11:44

The key is to use tidyr::gather to gather the twitter:mail columns (from your filtered df2_search) as rows under a new column comms and then filter again to remove the empty "" rows. Your second pipe can then be:

library(dplyr)

result <- df2 %>% filter(id %in% df_search_ids) %>% 
                  gather("source","comms",twitter:mail) %>% 
                  filter(comms != "") %>%
                  inner_join(df3, by="comms") %>% 
                  select(id_res=id,comms_res=comms,target_res=target) %>%
                  arrange(id_res)

The look up for df3 is then an inner_join by comms, which keeps only the rows matched in both data frames. The rest is formatting the output result.

With this you should get with your input:

print(result)
##  id_res comms_res target_res
##1    id1      @one      text1
##2    id2 postcode2      text2
##3    id2 postcode2      text8
##4    id4     @four    text4.1
##5    id4      add4    text4.2
##6    id7    @seven    text7.1
##7    id7      add7    text7.2
##Warning messages:
##1: attributes are not identical across measure variables; they will be dropped 
##2: In inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) :
##  joining character vector and factor, coercing into character vector

Edit to get rid of warnings

As evident above, there are two warnings from the processing:

The first is from gather, and the explanation for this is found here.
The second is from the inner_join.

A trivial solution to get rid of both of these warnings is to convert the relevant data columns from factors to character vectors. For the warning from gather, the columns twitter, email, and mail from df2 need to be converted, and from the inner_join, the column comms from df3 needs to be converted. This can be done using:

df2[,2:4] <- sapply(df2[,2:4], as.character)
df3$comms <- as.character(df3$comms)

before processing.

Note that the result$comms_res column is now a character vector instead of a factor with levels from the original df3$comms (actually, even if we did not convert to characters, the result will be a character vector because inner_join does it for us as the warning says). This is OK if we don't care to preserve the factor in the result. However, if we actually do care about the set of possible levels from df3$comms that we want to preserve in result$comms_res, then we need to first save these from df3$comms before converting to characters:

## save these levels before converting to characters
df3.comms.levels <- levels(df3$comms)
df3$comms <- as.character(df3$comms)

and then convert both df3$comms and result$comms_res back to a factor with these levels after processing:

df3$comms <- factor(df3$comms, levels=df3.comms.levels)
result$comms_res <- factor(result$comms_res, levels=df3.comms.levels)

Hope this helps.

Thank you so much, @aichao - that works, and your explanation makes a lot of sense. — Craig Hamilton, Aug 04 '16 at 18:14
You're welcome. I'll be editing this further to get rid of those warnings, so stay tuned. — aichao, Aug 04 '16 at 18:15
This is great, @aichao - thanks so much for the comprehensive response. I also took a closer look at tidyr and can see how a sensible use of that would resolve this issue at source, by making df2 better organised. — Craig Hamilton, Aug 05 '16 at 07:01

Extract data from multiple columns in R data frame, then searching another

1 Answers1