0

So, what I am attempting here is that, trying to count the number of sequence in a data set that goes from A immediately to C than after some time in C goes to L. I want to count the number of times this occurs and the average time it takes for this to occur in time periods, which is sectioned off by time_1, time_2,... etc.

So say in R, I have a dataframe with headings like ID, t_1, t_2, t_3,.... and each can take values A, C and L. And say I have a huge amount of data, how would I be able to find the number of times that a sequence that starts with A then immediately after that is C, then after any amount of time (so going through the column for an individual) it will arrive at a state of L?

What I had is that:

Lets say that the data I have is path, where it describes the path that a person with different ID number go through for each time point

My attempt of solving the problem

But this is extremely inefficient, as I need to do all the cases of all the time points, how can one achieve this in R efficiently? Thank you! :)

For Example:

ID <- c("i_1", "i_2", "i_3", "i_4")
t_1 <- c("A","C","A","C")
t_2 <- c("C","A","C","L")
t_3 <- c("L","C","L","L")
t_4 <- c("C","L","L","L")

path <-data.frame("ID" = ID, "t_1" = t_1, "t_2"=t_2, "t_3" = t_3, "t_4" = t_4)
path

diff_path_01 <- path[path$t_1 =="A" & path$t_2 == "C" &path$t_3 == "L",]
diff_path_01
diff_path_02 <- path[path$t_1 =="A" & path$t_2 == "C" &path$t_3 == "C" & path$t_4 == "L",]
diff_path_02
diff_path_03 <- path[path$t_2 =="A" & path$t_3 == "C" &path$t_4 == "L",]
diff_path_03
row(diff_path_03)

count <- nrow(diff_path_01)+nrow(diff_path_02) +nrow(diff_path_03)
count

So the count is the output of the number of sequence from A > C > L However for the average time it takes, I am not sure how to attempt it, I know that i should be counting the element C between A and L's but dont know how to implement that

Hope someone can help, thank you!

Kazusa12345
  • 111
  • 5
  • Please check your sample data. Using `<-` inside `data.frame` is not doing what you think it's doing. To assign a vector to a column the syntax is `data.frame(name_of_column = column_vector)` (note the `=` instead of `<-`). I'm also not quite clear on what you're trying to do. Please include your expected output for the sample data you give. – Maurits Evers Nov 29 '19 at 04:53
  • I dont have an expected output that i could do from R cause I dont know how to do it in R. What I am attempting here is count the number of paths from A immediately to C at next time point than after some time to L. What I am doing is count the number of these transitions and the average time it takes to go through this process. And also yeah the data.frame part is mb – Kazusa12345 Nov 29 '19 at 05:16
  • Also in case you need more information this is the data of say 5 people, going through different states in each state_i so it forms a path for the 10 time periods – Kazusa12345 Nov 29 '19 at 05:18
  • Please don't include critical information in comments (comments are transient); instead [edit](https://stackoverflow.com/posts/59098302/edit) your post. Even if you *"dont know how to do it in R"* you still need to include your **final & expected output**. You can manually construct the final `data.frame` or `list` (or whatever the expected format is). This will help us understand what you're trying to do. I also suggest reviewing [how to make a great reproducible example in R](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Maurits Evers Nov 29 '19 at 05:25
  • I already specified the final output that i would expect in the original edit .... Also there is no critical information, I just resaid the same thing i said above in a different way – Kazusa12345 Nov 29 '19 at 05:29
  • So is `diff_path` your expected output? If so, this is not clear *at all* from your post. – Maurits Evers Nov 29 '19 at 05:36
  • My expected output is 5 for the count and 13/5 for the time it takes for the sequence. diff_path is my attempt at trying to do this problem, however it is very inefficient for large sample sets as it has to much combinations. – Kazusa12345 Nov 29 '19 at 05:38
  • Since if you look at my diff_path function right, it shows to only show the data in which at time 1 = A, time 2 = C than at time 10 =L. I can do this manually to specify all the combinations that gives the sequence A -> C -> L, but it would be tedious, and i assume there is a better way in R to do this – Kazusa12345 Nov 29 '19 at 05:41
  • Ok I've got no idea what you're after and you don't seem to (want to) understand what I'm trying to tell you. So I will leave it at that. Please read the post on what we expect from you in terms of providing a minimal & reproducible example plus a clear problem statement. **Provide your expected output; not in prose, but as the expected R object!** Make it as easy for others to help! – Maurits Evers Nov 29 '19 at 05:44
  • What do you mean by expected R object? I gave you the expected output of the given dataset already, that is the expected result that you should get. It is true that my problem statement aint that clear, thats mb. But what else am i supposed to include? – Kazusa12345 Nov 29 '19 at 05:50
  • Do you need `library(stringr);apply(path[-1], 1, function(x) str_count(paste(rle(x)$values, collapse=""), "ACL"))` – akrun Nov 29 '19 at 05:52
  • I have provided a mock data.frame already though? It is path, I already created that data frame when you first ask me to? – Kazusa12345 Nov 29 '19 at 05:57
  • ahhh i get what you mean now, you want me to attempt the question in hand through R, and have the output in R and not me manually doing the output by hand – Kazusa12345 Nov 29 '19 at 05:57
  • Is this what you mean by expected output? – Kazusa12345 Nov 29 '19 at 06:16

2 Answers2

0

One way to do it is to create a single string for each row containing the complete sequence. From this you can use str_extract_all() to extract all occurences of the specified sequence from this string.

As an example I used another vector than your example to show more occurences:

library(stringr)

x <- c("A", "C", "L", "A", "L", "A", "C", "C", "L", "A", "C", "A", "L", "A", "C", "C", "C", "L")
ACL <- unlist(str_extract_all(paste0(x, collapse = ""), "AC+?L"))

ACL
#[1] "ACL"   "ACCL"  "ACCCL"

The AC+? is a regular expression to search for a sequence beginning with A and ending with L with at least one C inbetween.

You can then easily extend this for your whole data set and calculate the number of occurences and the average time it takes.

apply(path[, -1], 1, function(x) {
  ACL <- unlist(str_extract_all(paste0(x, collapse = ""), "AC+?L"))
  c(count = length(ACL), average_time = mean(nchar(ACL)))
})

#              [,1] [,2] [,3] [,4]
# count           1    1    1    0
# average_time    3    3    3  NaN
Gilean0709
  • 1,098
  • 6
  • 17
0

Not sure if this is what you want. Here count is TRUE if the path satisfy your rule, and avgtime count the number of Cs between A and L

# concatenate all alphabets in the data frame as a long string
s <- paste0(as.vector(t(path[-1])),collapse = "")

# divide the long string `s` into sub-string array `v` (the number of elements in the array equals to the number of rows of data frame)
v <- substring(s,seq(1,nchar(s)-3,4),seq(4,nchar(s),4))

# find the index where the sub-strings in `v` match the pattern `AC...L`
idx <- grep("AC.*?L",v)

# create data frame `df`
df <- data.frame(ID = path$ID,count = FALSE,avgtime = NA)

# assign the index in `count` to `TRUE` according to the matched search
df$count[idx] <- TRUE

# count the number of `C` by `nchar()` where the sub-string is in the form of `A...L`.
df$avgtime[idx] = nchar(gsub("[AL]","",gsub(".*?(A.*?L).*","\\1",v[idx])))

Using different data (different from the one you posted) as an example:

path <- structure(list(ID = structure(1:4, .Label = c("i_1", "i_2", "i_3", 
"i_4"), class = "factor"), t_1 = structure(c(1L, 1L, 1L, 2L), .Label = c("A", 
"C"), class = "factor"), t_2 = structure(c(2L, 1L, 2L, 1L), .Label = c("A", 
"C"), class = "factor"), t_3 = structure(c(2L, 1L, 1L, 2L), .Label = c("C", 
"L"), class = "factor"), t_4 = structure(c(1L, 2L, 2L, 2L), .Label = c("C", 
"L"), class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))

> path
   ID t_1 t_2 t_3 t_4
1 i_1   A   C   L   C
2 i_2   A   A   C   L
3 i_3   A   C   C   L
4 i_4   C   A   L   L

we can get the result like below:

> df
   ID count avgtime
1 i_1  TRUE       1
2 i_2  TRUE       1
3 i_3  TRUE       2
4 i_4  FALSE     NA
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • May I ask what does the `v <- substring` mean? I get that for s, it expands the data.frame in to a (horizontal?) vector, but I am not quite sure what the `substring` means. If you may can you explain how the `seq` and `nchar(s)` work. Thank you Secondly, what does the `.*?` mean, I understand that `*` represents zero or more times, but why is `?` once or none there and what does the `.` do? Lastly, say if C wasnt the only other option between A and L, say if there was another option, is it possible to do the same thing and add | in the middle? Thank you – Kazusa12345 Dec 02 '19 at 04:22
  • Also can this be expanded to including phrases instead of just letters? – Kazusa12345 Dec 02 '19 at 04:48
  • @Kazusa12345 I added comments in my code so you can read it easily. Regarding the regex, `.*?` indicates the shortest search that matches the pattern. For example, with `AC.*?L`, the search will be done when it sees the first `L`, e.g., `ACLLLLL` with `AC.*?L` gives you `ACL`. If with `AC.*L`, it searches the longest one, thus giving you `ACLLLLL`. – ThomasIsCoding Dec 02 '19 at 06:26
  • @Kazusa12345 I guess it might work for phrases, but depending on your exact case – ThomasIsCoding Dec 02 '19 at 06:26