0

Suppose I have two dataframes which look like this:

df1
ID  Chr
1   a
2   a
3   a
4   a
5   a
6   a
7   b
8   b
9   b
10  b
11  c
12  c
13  a
14  a
15  a
16  a
17  c
18  c
19  c
20  a
df2
ID Chr
1   a
2   a
3   b
4   b
5   b
6   b
7   b
8   b
9   b
10  b
11  c
12  c
13  a
14  a
15  c
16  c
17  c
18  a
19  a
20  a

If you look at the two dfs you can see that they are quite similar. In fact if it is like this i consider them part of the same set. But the issue is that they are not aligned too well. In this small sample it might not seem like a big deal but with the actual data with more than 1000 rows the alignment is a big problem.

The issue is that my matching algorithm is pretty basic and compares one row of the df1 to the corresponding row of df2 and gives a score of 1 is there is a match and 0 for a mismatch. What complicates the issue is that I'm not matching all the rows of the dataframes at once either. Due to the circumstances I have to do partial matches. For example with the above data I would match by 5 rows. The first five rows of df1 against five rows of df2. When I minimize the scale the issue becomes worse.

So the question is can I do something about the alignment without having to resort to matching the entire dfs at once.

Cettt
  • 11,460
  • 7
  • 35
  • 58
Nix
  • 149
  • 8
  • 1
    Are you searching for something like [these](https://stackoverflow.com/questions/9065536/text-comparison-algorithm)? – Rui Barradas Jul 05 '19 at 14:47

1 Answers1

2

I am not sure if I understood you correctly. If you only want to compare the chr colums you could join the two tables and then check for the chr columns.

This is very easy if you use the dplyr package. First I create some toy data:

df1 <- data.frame(id = 1:5, chr = c("a", "a", "a", "b", "b"))
df2 <- data.frame(id = 1:5, chr = c("a", "b", "b", "b", "b"))

library(dplyr)
left_join(df1, df2, by = "id", suffix = c("_1", "_2")) %>% 
  mutate(flag = if_else(chr_1 == chr_2, 1, 0))

  id chr_1 chr_2 flag
1  1     a     a    1
2  2     a     b    0
3  3     a     b    0
4  4     b     b    1
5  5     b     b    1
Cettt
  • 11,460
  • 7
  • 35
  • 58