1

I have this dataframe:

df <- data.frame(id_1=c('888046309', '888046309', '888046309', '888046309', '003046309', '465798132', '465798132', '465798132', '465798132', '465798132', '465798132', '465798132', '465798132'), 
                 id_2=c('0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309'))

and I would like to create a column that would indicate where a part of id_2 is present in id_1.

I tried this from this suggestion

i <- sapply(seq_along(df$id_2), function(i) grepl(df$id_2[i], df$id_1[i]))

df$flag <- c("No", "Yes")[i + 1L]

and flag had all NO values even though you can see that 46309 is present in both id_1 & id_2.

I then tried this from this suggestion

df$flag_2 <- str_detect(df$id_1, df$id_2)

And I got FALSEs for flag_2

Finally I tried this, and like flag_2, I got all FALSEs for flag_3.

df <- df %>% 
    mutate(flag_3 = c('No', 'Yes')[1+str_detect(id_1, as.character(id_2))])

Can these suggestions be edited so that it can indicate if some part of id_2 is present in id_1?

jay.sf
  • 60,139
  • 8
  • 53
  • 110
tangerine7199
  • 443
  • 2
  • 8
  • 24
  • Perhaps [this](https://stackoverflow.com/q/28261825/13513328) helps – Waldi May 18 '21 at 16:43
  • the function listed at the end of this suggestion only allows for situations where the leading numbers are the same. so thanks but it didn't work. – tangerine7199 May 18 '21 at 16:50
  • So, just to be clear, 1 character overlap would still need to be marked as "Yes"? – slamballais May 18 '21 at 17:06
  • good question. How about four or more? – tangerine7199 May 18 '21 at 17:06
  • This answer seems to work for your question, albeit you need to add `nchar( ) >= 4`: https://stackoverflow.com/a/48702045/5805670 – slamballais May 18 '21 at 17:12
  • i'm really sorry @slamballais but I dont know how one would go about applying this to a dataframe – tangerine7199 May 18 '21 at 17:16
  • `apply(df, 1, function(x) nchar(larsub(x)) > 4)` . Edit: Do note that this solution takes ANY 4 characters, they don't have to be subsequent. If this is not what you want, maybe try and put all the requirements that you can think of into the opening post. – slamballais May 18 '21 at 17:30

2 Answers2

1

I am not sure if this is what you have in mind, I tried it and I got some encouraging results, but there is still room for improvement. For this I erased 2 0 from the id_2 maybe it wasn't a good call but it facilitate fuzzy matching here.

df$flag <- sapply(1:nrow(df), function(x) agrepl(substr(df$id_2[x], 2, nchar(df$id_2[x])), 
                                                 df$id_1[x], max.distance = 4))


df

        id_1       id_2  flag
1  888046309 0003046309  TRUE
2  888046309 0003046309  TRUE
3  888046309 0003046309  TRUE
4  888046309 0003046309  TRUE
5  003046309 0003046309  TRUE
6  465798132 0003046309 FALSE
7  465798132 0003046309 FALSE
8  465798132 0003046309 FALSE
9  465798132 0003046309 FALSE
10 465798132 0003046309 FALSE
11 465798132 0003046309 FALSE
12 465798132 0003046309 FALSE
13 465798132 0003046309 FALSE
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • It doesn't generalize well to new data: `agrepl("8888", "8887", max.distance = 4)` returns `TRUE` (for any `max.distance` except `0`). – slamballais May 18 '21 at 17:28
  • It should return `TRUE` for your example because they partial match. But if you try it with `grepl` the result is `FALSE`. I know this is not a very good solution but still have some `TRUE`s than all `FALSE`. – Anoushiravan R May 18 '21 at 17:33
  • Yeah, sorry, I was referencing to OPs requirement for >3 matching characters (see comments), but it seems they are happy with the answer. – slamballais May 18 '21 at 17:35
  • 1
    No it's ok there are certainly better solutions than this as it only meets bare minimum of OP's requirements. But I thought it still better than no match. It should after all be tested on a larger data set. – Anoushiravan R May 18 '21 at 17:43
  • 1
    @slamballais `agrepl` function has proven very useful to me but I don't know why it is not used very often by users. – Anoushiravan R May 18 '21 at 17:44
  • 1
    Thank you for this! What if i wanted to do four or more? I tried max.distance >= 4)) and an error was produced: " Error in .amatch_bounds(max.distance) : object 'max.distance' not found " – tangerine7199 May 18 '21 at 18:38
  • You're welcome. Glad it could help somehow. The value you supply to `max.distance` should be an interger or a fraction of the pattern length. It cannot be a logical expression. – Anoushiravan R May 18 '21 at 18:42
1

You could use adist which basically calculates the cost to convert one string into another. Define a threshold how much replacement you want to allow and set a respective flag.

df$d <- t(apply(df[c('id_1', 'id_2')], 1, adist))[,2]
df
#         id_1       id_2 d
# 1  888046309 0003046309 4
# 2  888046309 0003046309 4
# 3  888046309 0003046309 4
# 4  888046309 0003046309 4
# 5  003046309 0003046309 1
# 6  465798132 0003046309 9
# 7  465798132 0003046309 9
# 8  465798132 0003046309 9
# 9  465798132 0003046309 9
# 10 465798132 0003046309 9
# 11 465798132 0003046309 9
# 12 465798132 0003046309 9
# 13 465798132 0003046309 9

th <- nchar(df$id_2)[1] - 2
df$flag <- with(df, ifelse(d > th, 1, 0))
df
#         id_1       id_2 d flag
# 1  888046309 0003046309 4    0
# 2  888046309 0003046309 4    0
# 3  888046309 0003046309 4    0
# 4  888046309 0003046309 4    0
# 5  003046309 0003046309 1    0
# 6  465798132 0003046309 9    1
# 7  465798132 0003046309 9    1
# 8  465798132 0003046309 9    1
# 9  465798132 0003046309 9    1
# 10 465798132 0003046309 9    1
# 11 465798132 0003046309 9    1
# 12 465798132 0003046309 9    1
# 13 465798132 0003046309 9    1

Combined

We can of course combine the two steps then it looks like this:

th <- 8  ## arbitrary
df <- within(df, flag <- t(apply(df[c('id_1', 'id_2')], 1, adist))[,2] < th)
df
#         id_1       id_2  flag
# 1  888046309 0003046309  TRUE
# 2  888046309 0003046309  TRUE
# 3  888046309 0003046309  TRUE
# 4  888046309 0003046309  TRUE
# 5  003046309 0003046309  TRUE
# 6  465798132 0003046309 FALSE
# 7  465798132 0003046309 FALSE
# 8  465798132 0003046309 FALSE
# 9  465798132 0003046309 FALSE
# 10 465798132 0003046309 FALSE
# 11 465798132 0003046309 FALSE
# 12 465798132 0003046309 FALSE
# 13 465798132 0003046309 FALSE
jay.sf
  • 60,139
  • 8
  • 53
  • 110