1

I have a data set that contains a column with strings made up of 4 letters (A,T,C,G); these strings range from 2-1991 characters long. I would like to subset all rows where the strings match a particular pattern. For example, I would like to create a new dataframe that subsets all rows where there are 0-10 consecutive Ts in column 17.

Please let me know if you require additional information and thank you for your time!

cbaudo
  • 15
  • 3
  • 1
    Please provide a [reproducible example](https://stackoverflow.com/questions/5963269/) (other refs: [how-to-ask](https://stackoverflow.com/help/how-to-ask) and [minimal, verifiable examples](https://stackoverflow.com/help/mcve)). (My suggestion (of the many offered in those examples) are to use `dput` or build the data statically.) – r2evans Aug 10 '18 at 18:18
  • By *column 17* do you mean starting at position 17 in the string? – Rui Barradas Aug 10 '18 at 18:20
  • Furthermore, *all rows where there are 0-10 consecutive Ts* is, well, all rows. If it has 0 Ts, it will return it, if it has 1 T it will return it, making 2, 3, etc Ts irrelevant. I think the problem is not well defined. – Rui Barradas Aug 10 '18 at 18:27
  • Are you familiar with regular expressions? What you're looking for is something like 'vectorName[grep(".{17}TTTTTTTTTT", vectorName)]'. Look up regex if you're not familiar. – iod Aug 10 '18 at 18:33
  • Hi Rui: I have 17 columns in my dataframe, the strings are in column 17. In this example, AGCTCA there is 1 T and it would subset this row. In AGCAGAAAACGGGAGTTTTTT, it would also subset this row. But it would not subset in this row: TTTTTTTTTTTAGCAG (because it has 11 Ts). – cbaudo Aug 10 '18 at 18:37
  • OK, that makes it clear. – Rui Barradas Aug 10 '18 at 19:05

1 Answers1

1

You could filter out all rows where you find 11 consecutive Ts, which would include rows that have 11 consecutive Ts, and rows that have more.

## Example vector
v = c("TTTTTTTTTTACAGATAT","TTTACACAC","TTTTTTTTTTTTTACAGAT","TTTTTTTTTTTACAG")
v[!grepl("T{11}",v)]
[1] "TTTTTTTTTTACAGATAT" "TTTACACAC"

Edit to also include cases where you want to look for 11-20 consecutive Ts

If you want to select rows that have between 11 and 20 Ts, you could use a negative lookbehind and a negative lookahead, to search for a stretch of between 11 and 20 Ts that is neither preceded nor followed by a T.

## Second example vector:
v2 = c("TTTTTTTTTTACAGATAT","TTTACACAC","TTTTTTTTTTTTTACAGAT","TTTTTTTTTTTACAG","ACTTTTTTTTTTTTTTTTTTTTTGCGCA")

v2[grepl("(?<!T)T{11,20}(?!T)",v2,perl=T)]
[1] "TTTTTTTTTTTTTACAGAT" "TTTTTTTTTTTACAG"   
Lamia
  • 3,845
  • 1
  • 12
  • 19
  • Beat me to it by some seconds, but the regex should be `"T{11,}"` because it can be 11 *or more* Ts. Upvote. – Rui Barradas Aug 10 '18 at 19:12
  • 1
    Actually, if there are more than 11 consecutive Ts, then there will necessarily be 11 consecutive Ts, as mentioned in my answer. – Lamia Aug 10 '18 at 19:14
  • Thanks for the suggestion! My end goal is to have multiple subsets including a 11-20 consecutive T dataset for example. This strategy would include <11 consecutive Ts and be quite tedious when I get into the higher bin sets. – cbaudo Aug 10 '18 at 19:17
  • Thanks a million, Lamia! – cbaudo Aug 10 '18 at 19:33