I have a data frame with sequences of peptides in the row "ID". I have the sequences grouped into many groups with around 2-10 rows per group. The groups contain some peptides that align almost perfectly (up to 4 differences in characters) and others that are completely different. I want to subset my data frame and create a new one with only the "unique" values: meaning, only the values from each group that are different from one another. If there are few aligning sequences, I want the one remaining to be the longest one (I have created a column for "character_number"). I thought about using elseif and the function adist() with a cutoff of <6 (less than 6 differences - only the max(charecter_count) will be taken), but I have no idea how to start. any ideas will be appreciated!
index | id | Description | charecter_count |
---|---|---|---|
3 | AAGKGPLATGGIAA | vlad12 | 14 |
4 | AAGKGPLATGGIAASGKK | vlad12 | 18 |
5 | AAKAQYRAAALLGAAVPG | bla872 | 18 |
6 | AAKPKVAKAKKVVVKKK | plm123 | 17 |
7 | AAPAPAAAPAPAPAAAPEP | bbaala | 19 |
8 | AAPAPAAAPAPAPAAAPEPE | bbaala | 20 |
9 | AAPAPAAAPAAAPAPAPEPER | bbaala | 21 |
443 | ILVRYTQPAPQVSTPT | cvacba | 16 |
444 | ILVRYTQPAPQVSTPTL | cvacba | 17 |
736 | NPSLPPPERPAAEAMC | cvacba | 16 |
here for example, I would want a new data frame with rows: 4 (3 is basically the same but shorter), 5,6,9,444,736 (here they both have the same description but different sequences)
using:
adist(all_peptides$id[3],all_peptides$id[4]
> I get 4, which is below by desired cutoff so I would like it to select only 4.
however, adist(all_peptides$id[444],all_peptides$id[736])
> is 16, so I would like both to b included in the new data frame. however, I don't know how to implement this on a larger scale (compare all sequences from the same group etc).