1

I need to write a search function to look for start and end location of certain elements in a large dataset using R.

My sample dataset is like below:

C1   C2  Index
aa   J    1   
aa   J    2
aa   J    3
ab   O    4
aa   O    5
aa   J    6
aa   J    7
aa   J    8
aa   J    9
aa   K    10
ac   K    11
aa   J    12
aa   J    13

I want to write a search function like search("aa","J") (where "aa" is value from C1 column and "J" is value from C2 column). The function will first subset the dataset according to "aa"; then provide the indices according to this subset.

The result will return indices of all positions found in a matrix like below:

        [,1]   [,2]
[1,]     1      3
[2,]     5      8
[3,]     10     11

Thank you very much.

I tried to modify the provided code; but there is error. Can you pls help to take a look?

get_inds <- function(test, C1, C2) {
   test <- subset(test, test$C1 == C1)
   inds <- rle(test$C1 == C1 & test$C2 == C2)
   end = cumsum(inds$lengths)
   start = c(1, head(end, -1) + 1)
   data.frame(start, end)[inds$values, ]
}

get_inds(test, 'aa', 'J')
user247704
  • 13
  • 4
  • In `search`, 1 is what you're looking for? – NelsonGon May 13 '19 at 10:05
  • 1
    Take a look at [Find start and end positions/indices of runs/consecutive values](https://stackoverflow.com/questions/43875716/find-start-and-end-positions-indices-of-runs-consecutive-values) – markus May 13 '19 at 10:06
  • Nope, 1 is value in C2 – user247704 May 13 '19 at 10:06
  • Sorry, I don't get it. Could you explain what `search("aa",1)` is doing? It seemed from the output that it was finding aa==1 in C2, no? – NelsonGon May 13 '19 at 10:18
  • Sorry for confusing. I have edited the dataset to be easier to understand. In the search function is the 2 values that I input from column C1 &C2. And the output I expected will be all indices position of the search input. Hope it help to clarify. Thanks. – user247704 May 13 '19 at 10:26

1 Answers1

1

The link provided by @markus solves your problem, you need to modify it according to your requirement.

get_inds <- function(test, a, b) {
   test <- subset(test, C1 == a)
   inds <- rle(test$C1 == a & test$C2 == b)
   end = cumsum(inds$lengths)
   start = c(1, head(end, -1) + 1)
   df = data.frame(start, end)[inds$values, ]
   row.names(df) <- NULL
   df
} 

get_inds(test, 'aa', 'J')

#  start end
#1     1   3
#2     5   8
#3    10  11

You need to change the condition for rle and remove the rows where the condition is not satisfied.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • @ Ronak Shah. Thank you very much for your solution. I'm trying to remove the ordering number of the result dataframe (#1, #3, #5) or changing it to #1, #2, #3. Do you have any idea how to do it? Thank you very much for your help. – user247704 May 13 '19 at 11:21
  • @user247704 Sure, I have updated the answer. Please have a look. – Ronak Shah May 14 '19 at 00:40
  • @ Ronak Shah. Thank you very much for helping through. I know it is a few lines of code; but I just cannot get it work :( I'm learning from your code. I have edited my question because I realized a subset of dataset according to condition given by column C1 (which is "aa") need to be done first before running search for the first and end position. And the result to get is the indices based on subset. So the desired result should be ( #1 (1,3); #2 (5, 8); #3 (10 ,11) instead. I have edited the questions. In your code, I added in a subset function; but it does not work. Could you please help? – user247704 May 14 '19 at 02:12
  • @user247704 I have updated the answer accordingly. Can you check? – Ronak Shah May 14 '19 at 02:20
  • Finally, got it! Thank you for teaching me a good lesson about naming variable. Just change the name and it works perfectly :) – user247704 May 14 '19 at 02:45
  • 1
    Done and thanks once again! – user247704 May 14 '19 at 03:24