String subset using a pattern and range of string length and in R

Question

I have a data set that contains a column with strings made up of 4 letters (A,T,C,G); these strings range from 2-1991 characters long. I would like to subset all rows where the strings match a particular pattern. For example, I would like to create a new dataframe that subsets all rows where there are 0-10 consecutive Ts in column 17.

Please let me know if you require additional information and thank you for your time!

Please provide a [reproducible example](https://stackoverflow.com/questions/5963269/) (other refs: [how-to-ask](https://stackoverflow.com/help/how-to-ask) and [minimal, verifiable examples](https://stackoverflow.com/help/mcve)). (My suggestion (of the many offered in those examples) are to use `dput` or build the data statically.) — r2evans, Aug 10 '18 at 18:18
By *column 17* do you mean starting at position 17 in the string? — Rui Barradas, Aug 10 '18 at 18:20
Furthermore, *all rows where there are 0-10 consecutive Ts* is, well, all rows. If it has 0 Ts, it will return it, if it has 1 T it will return it, making 2, 3, etc Ts irrelevant. I think the problem is not well defined. — Rui Barradas, Aug 10 '18 at 18:27
Are you familiar with regular expressions? What you're looking for is something like 'vectorName[grep(".{17}TTTTTTTTTT", vectorName)]'. Look up regex if you're not familiar. — iod, Aug 10 '18 at 18:33
Hi Rui: I have 17 columns in my dataframe, the strings are in column 17. In this example, AGCTCA there is 1 T and it would subset this row. In AGCAGAAAACGGGAGTTTTTT, it would also subset this row. But it would not subset in this row: TTTTTTTTTTTAGCAG (because it has 11 Ts). — cbaudo, Aug 10 '18 at 18:37

Lamia · Accepted Answer · 2018-08-10T19:25:24.733

1

You could filter out all rows where you find 11 consecutive Ts, which would include rows that have 11 consecutive Ts, and rows that have more.

## Example vector
v = c("TTTTTTTTTTACAGATAT","TTTACACAC","TTTTTTTTTTTTTACAGAT","TTTTTTTTTTTACAG")
v[!grepl("T{11}",v)]
[1] "TTTTTTTTTTACAGATAT" "TTTACACAC"

Edit to also include cases where you want to look for 11-20 consecutive Ts

If you want to select rows that have between 11 and 20 Ts, you could use a negative lookbehind and a negative lookahead, to search for a stretch of between 11 and 20 Ts that is neither preceded nor followed by a T.

## Second example vector:
v2 = c("TTTTTTTTTTACAGATAT","TTTACACAC","TTTTTTTTTTTTTACAGAT","TTTTTTTTTTTACAG","ACTTTTTTTTTTTTTTTTTTTTTGCGCA")

v2[grepl("(?<!T)T{11,20}(?!T)",v2,perl=T)]
[1] "TTTTTTTTTTTTTACAGAT" "TTTTTTTTTTTACAG"

edited Aug 10 '18 at 19:25

answered Aug 10 '18 at 19:10

Lamia

3,845
1
12
19

Beat me to it by some seconds, but the regex should be `"T{11,}"` because it can be 11 *or more* Ts. Upvote. – Rui Barradas Aug 10 '18 at 19:12
1

Actually, if there are more than 11 consecutive Ts, then there will necessarily be 11 consecutive Ts, as mentioned in my answer. – Lamia Aug 10 '18 at 19:14
Thanks for the suggestion! My end goal is to have multiple subsets including a 11-20 consecutive T dataset for example. This strategy would include <11 consecutive Ts and be quite tedious when I get into the higher bin sets. – cbaudo Aug 10 '18 at 19:17
Thanks a million, Lamia! – cbaudo Aug 10 '18 at 19:33

String subset using a pattern and range of string length and in R

1 Answers1