How to find a sequence in r in the middle of the texts?

Question

So say there is a string of t and f, how might one use the grep function to find the pattern of say, something starting with f and stays in f for some time and go to t and I want to count the number of times it stays in t

a <- "fffftttfff"
b <- "fttttttfff"
c <- "tttttttttt"
d <- "fffffffftf"
path_ <- c(a,b,c,d)
ID <- 1:4

tf_dt <- data.table("ID" = ID,"path" = path_)
tf_dt

   ID       path
1:  1 fffftttfff
2:  2 fttttttfff
3:  3 tttttttttt
4:  4 fffffffftf

dt_raw <- tf_dt[,-1]
s <- paste0(as.vector(t(dt_raw)), collapse = "")
v <- substring(s,seq(1,nchar(s)-9,10), seq(10,nchar(s),10))
idx <- grep("^f*f.+t",v)
dt_final <- data.frame("ID" = tf_dt$ID, count = FALSE, time = NA)
dt_final$count[idx] <- TRUE
dt_final$time[idx] <- ???

What I reckon I should do is to remove the first string of f and all the remaining string of letters after the first string of t appearance. However I am not sure how might I be able to do that? Any help is appreciated.

My attempt:

nchar(gsub("^f*f","",gsub("something that relates to the end of the string","",v)))

More attempts:

#If I do gsub("^f*f+t*","",v) it gives me the last string that I want to remove
#But I cant do something like
nchar(gsub("^f*f","",gsub("gsub("^f*f+t*","",v)$",""v)))

Expected Output:

tf_count <- c(TRUE,TRUE,FALSE,TRUE)
tf_time <- c(3,6,NA,1)
output <- data.table("ID" = ID, "count" = tf_count,"time_taken" = tf_time)

#     ID count time_taken
# 1:  1  TRUE          3
# 2:  2  TRUE          6
# 3:  3 FALSE         NA
# 4:  4  TRUE          1

Also side note, is there somewhere that I can look at a lot of examples of how grep() and stringr() works. (I think from what I have seen this is under stringr()?) I tried reading things on this, but nothing really came out of it, and I am still equally as confused as before. Thanks.

What would be your expected output for input like `"fffffffftftt"` ? Should it be 1 or 3? — Ronak Shah, Dec 09 '19 at 07:06

score 3 · Accepted Answer · answered Dec 09 '19 at 08:15

3

A solution in base using grepl and gsub as you have tried already in the question.

tf_count <- grepl("^f+t+", tf_dt$path)
tf_time <- nchar(gsub("^f+(t+).*","\\1",tf_dt$path))
tf_time[!tf_count]  <- NA
output <- data.frame("ID" = ID, "count" = tf_count,"time_taken" = tf_time)
output
#  ID count time_taken
#1  1  TRUE          3
#2  2  TRUE          6
#3  3 FALSE         NA
#4  4  TRUE          1

answered Dec 09 '19 at 08:15

GKi

37,245
2
26
48

So let me try understand this as much as I could, cause I am unsure about parts of the code. First line, we try to find the pattern in which the strings have patter that starts with `f` and has at least 1 `f` and then has at least 1 `t`. Second line, we first use `gsub` to substitute the aforementioned string, and then the `.` refers to any character that is not `t`(?) to `""`, then it gets to the part i dont quite understand, what does `"\\1"` do?. After this i understand all of it, its just this part that is escaping me. Thanks – Kazusa12345 Dec 09 '19 at 22:00
1

`\\1` writes what was found in `(t+)`. – GKi Dec 10 '19 at 07:23

Ronak Shah · Answer 2 · 2019-12-09T06:59:02.400

2

One way would be to find out number of t's after removing the first set of f's which can be achieved by

library(data.table)

tf_dt[, time_taken:= NA_integer_]
tf_dt[grep('^f', path), time_taken := nchar(sub('^f*(t{1,}).*', '\\1',path))]
tf_dt

#   ID       path time_taken
#1:  1 fffftttfff          3
#2:  2 fttttttfff          6
#3:  3 tttttttttt         NA
#4:  4 fffffffftf          1

edited Dec 09 '19 at 06:59

answered Dec 09 '19 at 06:32

Ronak Shah

377,200
20
156
213

Zhiqiang Wang · Answer 3 · 2019-12-10T01:05:03.497

2

If you are interested in a stringr & tidyverse solution, try the following code. I borrowed a piece of code "^f*(t{1,})" from Ronak Shah's exellent answer:

tf_dt %>% 
  mutate(count = str_detect(path, "ft"),
         time_taken = ifelse(count, str_count(str_extract(path, "^f*(t{1,})"), "t"), NA))

edited Dec 10 '19 at 01:05

answered Dec 09 '19 at 06:37

Zhiqiang Wang

6,206
2
13
27

1

One issue with this solution could be it would fail if there are strings like `"fffffffftftt"`. If I have understood OP correctly, they would need answer as 1 in such case. But I maybe wrong. – Ronak Shah Dec 09 '19 at 06:54
1

@RonakShah That's true. I am not clear from the OP that this should be `1` or `3`. The code will return `3` for this case. The code can be fixed if `1` is expected. – Zhiqiang Wang Dec 09 '19 at 07:00
@ZhiqiangWang how might one do so in `stringr()` if i expect 1 in such case, I am not too familiar with `tidyverse` & `stringr` – Kazusa12345 Dec 09 '19 at 21:50
1

`str_count` of `stringr` works with a regular expression pattern. I borrowed a piece of code from @RonakShah, and edited my answer. – Zhiqiang Wang Dec 10 '19 at 00:56

How to find a sequence in r in the middle of the texts?

3 Answers3