I'm using sparklyr for the first time and I'm having trouble matching strings of two vectors to create a new variable at scale. My problem has the following general structure:
I have one large dataset of urls:
df_1 <- data.frame(
col1 = c(1,2,3,4,5,6,7,8,9,10),
col2 = c("john.com/abcd", "ringo.com/defg", "paul.com/hijk", "george.com/lmno", "rob.com/pqrs", "sam.com/tuvw",
"matt.com/xyza", "lenny.com/bcde", "bob.com/fghi", "tom.com/jklm"))
col1 col2
1 john.com/abcd
2 ringo.com/defg
3 paul.com/hijk
4 george.com/lmno
5 rob.com/pqrs
6 sam.com/tuvw
7 matt.com/xyza
8 lenny.com/bcde
9 bob.com/fghi
10 tom.com/jklm
And another smaller dataset of general domains:
df_2 <- data.frame(
col1 = c(1,2,3,4,5,6,7),
col2 = c("john.com", "jake.com", "tim.com", "paul.com", "rob.com", "harry.com", "chris.com"))
col1 col2
1 john.com
2 jake.com
3 tim.com
4 paul.com
5 rob.com
6 harry.com
7 chris.com
I want to use the vector of domains in df_2 (df_2$col2) to create a dummy variable for df_1 that indicates if the domain occurs within the urls in df_1 (df_1$col_2). The resulting dataframe should look like df_3.
df_3 <- data.frame(
col1 = c(1,2,3,4,5,6,7,8,9,10),
col2 = c("john.com/abcd", "ringo.com/defg", "paul.com/hijk", "george.com/lmno", "rob.com/pqrs", "sam.com/tuvw",
"matt.com/xyza", "lenny.com/bcde", "bob.com/fghi", "tom.com/jklm"),
col3 = c(1,0,1,0,1,0,0,0,0,0))
col1 col2 col3
1 john.com/abcd 1
2 ringo.com/defg 0
3 paul.com/hijk 1
4 george.com/lmno 0
5 rob.com/pqrs 1
6 sam.com/tuvw 0
7 matt.com/xyza 0
8 lenny.com/bcde 0
9 bob.com/fghi 0
10 tom.com/jklm 0
I have read this post: How to filter on partial match using sparklyr
And have tried coding this for each individual observation of df_2 with something like,
df_3 <- df_1 %>%
mutate(col3 =
ifelse(like(df_1$col2, "john.com") | df_1$col2, "jake.com" | etc.,1,0))
But so far I have been running into either stack limits or R not recognizing the like functions. There must be an easier way to do this. Thank you for any help.