0

I have a data frame with sequences of peptides in the row "ID". I have the sequences grouped into many groups with around 2-10 rows per group. The groups contain some peptides that align almost perfectly (up to 4 differences in characters) and others that are completely different. I want to subset my data frame and create a new one with only the "unique" values: meaning, only the values from each group that are different from one another. If there are few aligning sequences, I want the one remaining to be the longest one (I have created a column for "character_number"). I thought about using elseif and the function adist() with a cutoff of <6 (less than 6 differences - only the max(charecter_count) will be taken), but I have no idea how to start. any ideas will be appreciated!

index id Description charecter_count
3 AAGKGPLATGGIAA vlad12 14
4 AAGKGPLATGGIAASGKK vlad12 18
5 AAKAQYRAAALLGAAVPG bla872 18
6 AAKPKVAKAKKVVVKKK plm123 17
7 AAPAPAAAPAPAPAAAPEP bbaala 19
8 AAPAPAAAPAPAPAAAPEPE bbaala 20
9 AAPAPAAAPAAAPAPAPEPER bbaala 21
443 ILVRYTQPAPQVSTPT cvacba 16
444 ILVRYTQPAPQVSTPTL cvacba 17
736 NPSLPPPERPAAEAMC cvacba 16

here for example, I would want a new data frame with rows: 4 (3 is basically the same but shorter), 5,6,9,444,736 (here they both have the same description but different sequences)

using: adist(all_peptides$id[3],all_peptides$id[4]> I get 4, which is below by desired cutoff so I would like it to select only 4. however, adist(all_peptides$id[444],all_peptides$id[736])> is 16, so I would like both to b included in the new data frame. however, I don't know how to implement this on a larger scale (compare all sequences from the same group etc).

Nina
  • 49
  • 3
  • 1
    Please provide some data so we can better understand your question. Also see: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – langtang Mar 14 '23 at 14:29
  • 1
    I think an example of your input and desired output would help us think about your problem. You can use `dput(name_of_dataframe)` on your data and post the result here, or at least the top 10-20 rows. – TimTeaFan Mar 14 '23 at 14:29
  • 3
    Also, for those of us less familiar with the alignment of peptide sequences than others, some examples of how to calculate alignment would be useful. – Limey Mar 14 '23 at 14:47
  • Hi, I added some info, hope that helps – Nina Mar 15 '23 at 09:58

1 Answers1

2

Assuming that Description denotes what you call "group", you can group_by(Description) to filter conditionally within each group. For example:

library(dplyr)
library(purrr)

df <- tribble(
  ~index, ~id,                     ~Description, ~character_count,
  3,      "AAGKGPLATGGIAA",        "vlad12",     14,
  4,      "AAGKGPLATGGIAASGKK",    "vlad12",     18,
  5,      "AAKAQYRAAALLGAAVPG",    "bla872",     18,
  6,      "AAKPKVAKAKKVVVKKK",     "plm123",     17,
  7,      "AAPAPAAAPAPAPAAAPEP",   "bbaala",     19,
  8,      "AAPAPAAAPAPAPAAAPEPE",  "bbaala",     20,
  9,      "AAPAPAAAPAAAPAPAPEPER", "bbaala",     21,
  443,    "ILVRYTQPAPQVSTPT",      "cvacba",     16,
  444,    "ILVRYTQPAPQVSTPTL",     "cvacba",     17,
  736,    "NPSLPPPERPAAEAMC",      "cvacba",     16
)

df %>%
  group_by(Description) %>%
  filter(
    ifelse(
      row_number() == 1 & n() == 1, FALSE, # to not use map if only one id in group
      map_lgl(id, \(x) min(adist(x, id[id != x])) >= 6)
    ) %>%
      ifelse(., ., nchar(id) == max(nchar(id[!.])))
  ) %>%
  ungroup()

# A tibble: 6 × 4
  index id                    Description character_count
  <dbl> <chr>                 <chr>                 <dbl>
1     4 AAGKGPLATGGIAASGKK    vlad12                   18
2     5 AAKAQYRAAALLGAAVPG    bla872                   18
3     6 AAKPKVAKAKKVVVKKK     plm123                   17
4     9 AAPAPAAAPAAAPAPAPEPER bbaala                   21
5   444 ILVRYTQPAPQVSTPTL     cvacba                   17
6   736 NPSLPPPERPAAEAMC      cvacba                   16

filter() keeps values that are TRUE. Here, map_lgl(id, \(x) min(adist(x, id[id != x])) >= 6) first set TRUE for ids that are at least 6 characters different from the other ids in the group (i.e., above your cutoff); then ifelse(., ., nchar(id) == max(nchar(id[!.]))) set TRUE for the reminder ids (those below the cutoff) if they are the longest one. Note that this will keep identical/duplicate ids within a group, since you didn't specify how to deal with that.

EDIT: A slightly different code where you first explicitly define a variable for whether they are above the cutoff and then filter, which perhaps is easier to follow:

df %>%
  group_by(Description) %>%
  mutate(
    above_cutoff = ifelse(
      row_number() == 1 & n() == 1, FALSE, # so to not use adist if only one id in group
      map_lgl(id, \(x) min(adist(x, id[id != x])) >= 6)
    )
  ) %>%
  filter(above_cutoff | nchar(id) == max(nchar(id[!above_cutoff]))) %>%
  ungroup()