-1

I have a question that is somewhat similar to this, but this solution did not work for me: MATCH function in r

I have two data frames that look like this:

df1

 query         page
 apple         site.com
 apple         site.com/apples
 banana        site.com/bananas
 bad apples    site.com/apples'

df2

page                sessions
site.com             20
site.com/apples      30
site.com/bananas     50'

I need a new column in df1 that shows sessions. When I tried using match() per the discussion above, the entire column returned as "NA." This was my code:

df1$sessions <- df2$sessions[match(df1$page, df2$page)]

I also tried merge:

df_merged = merge(df1, df2, by="page", all.x = TRUE)

It seems like a left join makes sense here, perhaps using sqldf + function(), but I can't seem to formulate this correctly and I'm not entirely sure if function is even needed or if sqldf can do this on its own. It would be fine to show "NA" in df1 wherever there is not a match but I'm getting "NA" even when there should be a match. I tried this, but to no avail:

df_merged <- left_join(df1, df2 by='page')
Jeff Swanson
  • 45
  • 1
  • 8
  • Hi. "I need a new column in df1 that shows sessions." is not clear. Please use enough words, phrases & sentences to say what you mean. Please give a [mcve] to help explain what you want. You do not say what output(s) you expect or would accept for that input. Find out what inner join & left join & other functions do, don't wonder. – philipxy May 04 '18 at 10:38

1 Answers1

1

Well the raw SQL query you would want is this:

SELECT df1.query, df1.page, df2.sessions
FROM df1
LEFT JOIN df2
    ON df1.page = df2.page

We can easily enough run this query using the sqldf package:

library(sqldf)
sql <- "SELECT df1.query, df1.page, df2.session FROM df1 LEFT JOIN df2 ON df1.page = df2.page"
result <- sqldf(sql)

I don't know what values you will get under the session column for those df1 rows which do not match. I would expect NA in this case.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Appreciate the response Tim. I tried this out and I am still getting 'NA' for the entire column. I'm not sure if this is more of a lookup type of problem but it seems like it has something to do with the fact that the same value in df2 needs to be repeated numerous times in df1. – Jeff Swanson Apr 30 '18 at 17:28
  • No, I think the problem is your data, not my code. Does the `page` column have any whitespace in it, in either of the two data frames? – Tim Biegeleisen Apr 30 '18 at 17:31
  • Ugh. I can't believe I missed this. One of the data sets includes https and one of them does not. So I just need to remove that so they can match. Thanks again. – Jeff Swanson Apr 30 '18 at 17:36
  • Size doesn't matter (to reapply that age old adage). If a row in `df1` has a page which can match to a page in `df2`, you should be pulling in the corresponding session. – Tim Biegeleisen Apr 30 '18 at 17:36