Return a single row out of multiple rows with partially matching entries

Question

I am reposting this question with a bit of more clarity. Unfortunately, didn't get any solutions from my previous posting. Please help me with this.

Below is what I want to do:

I have a dataset with the name of proteome. It has 14 columns and thousands of rows. Row 1, column 5: GHFCLKPGCNFHAESTRGYR Row 2, column 5: FCLKPGCNFHAESTRGYR Row 3, column 5: GHFCLKPGCNFHAESTR Row 4: column 5: GCNFHAESTR

Please click on this link to see the screenshot of a part of the original data frame; i67.tinypic.com/2wd0ap3.png[/IMG]

So, In row 2, first two letters of row 1 are missing; in row 3, last three letters of row 1 are missing; in row 4, first seven and last three letters of row 1 are missing.

Rows 2, 3, and 4 reflect the artifacts of the scientific method I have been using to generate the data, and therefore I want to remove these entries.

I want R to return only one of the four rows, ideally row 1, and remove the rest. The way R can do it is by first finding all rows with a matching string of letters and then eliminating such rows while keeping only one. For example, in the above data set, GCNFHAESTR match in all four rows, so I want R to return me only one row, ideally the top one. But I don't know how to do this.

Hope this makes better sense this time. I look forward to hearing from the experts.

Thanks!

please provide a sample of your dataset. Read https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example for instructions on how to ask a good question here. — Julian_Hn, Mar 28 '19 at 13:59

score 0 · Answer 1 · answered Mar 28 '19 at 18:47

In response to Julian_Hn suggestion, here is the dput of my dataset:

dput(Proteome)
    structure(list(Protein.name = structure(c(1L, 1L, 1L, 1L, 2L, 
    3L), .Label = c("HCTF", "IFT", "ROSF"), class = "factor"), X..Proteins = c(5L, 
    5L, 5L, 5L, 3L, 7L), X..PSMs = c(3L, 1L, 6L, 2L, 2L, 4L), Previous.5.amino.acids =   structure(c(4L, 
5L, 4L, 2L, 3L, 1L), .Label = c("CWYAT", "FCLKP", "MGCPT", "NCTMY", 
"TMYFC"), class = "factor"), Sequence = structure(c(5L, 1L, 4L, 
2L, 3L, 6L), .Label = c("FCLKPGCNFHAESTRGYR", "GCNFHAESTR", "GFGFNWPHAVR", 
"GHFCLKPGCNFHAESTR", "GHFCLKPGCNFHAESTRGYR", "GNFSVKLMNR"), class = "factor")), .Names = c("Protein.name", 
"X..Proteins", "X..PSMs", "Previous.5.amino.acids", "Sequence"
), class = "data.frame", row.names = c(NA, -6L))

Return a single row out of multiple rows with partially matching entries

1 Answers1