Search for the nearest string

Question

A table with many rows, but for simplicity of the problem...

> df <- data.frame(V1=c("imp: abc","apple","","imp: xyz","","lemon","ball","bell"),
                 V2=c("1","2","3","4","5","6","7","8"),
                 V3=c("NA","7","NA","2,8","NA","NA","NA","NA"))

      V1               V2              V3
1  imp: abc             1              NA
2  apple                2              7
3                       3              NA
4  imp: xyz             4              2,8
5                       5              NA
6  lemon                6              NA
7  ball                 7              NA
8  bell                 8              NA

So what I am trying to achieve is search for the nearest string "imp:" in V1 based on V3 and V2. For example, V3 is 7, then go to row 7 and from there search for the nearest "imp:" in V1 which is nothing but imp: xyz present in 4 th row.

If there are multiple numbers in V3, for example V3 is 2, 8 then first go to row 2 and from there search for the nearest "imp:" in V1 which is nothing but imp: abc present in 1st row and again go to row 8 and from there search for the nearest "imp:" in V1 which is nothing but imp: xyz present in 4 th row.

Here's what output looks like:

      V1               V2              V3          V4
1  imp: abc             1              NA
2  apple                2              7           imp: xyz 
3                       3              NA
4  imp: xyz             4              2,8         imp: abc, imp: xyz
5                       5              NA
6  lemon                6              NA
7  ball                 7              NA
8  bell                 8              NA

NOTE: Search has to be upwards.

Luke C · Accepted Answer · 2017-06-27T13:47:50.550

1

Here's one solution, although it's not all that fast- if you have many rows it might not be the best way to go. Using your df as above, first make sure your NA values are not strings, and add an NA-filled column V4:

df$V3[df$V3 == "NA"] <- NA
df$V4 <- NA

Parse the index values from V3 into individual numerical values:

keys <- strsplit(as.character(df$V3), ",")

keysn <- lapply(keysn, as.numeric)

Edit: Forgot these two lines to assign the rownames from the original df to the list of lists. Add these in and it should work- if not, let me know again!

names(keysn) <- rownames(df)

keysn <- keysn[!is.na(keysn)]

Then, use lapply to iterate over the list of lists of numerical indices, and take the V1 value of the max row number from row 1 up to the current index. For multi values (eg, 2,8), paste those values together before returning.

lx<-lapply(keysn, function(x){
  xout <-lapply(x, function(y){
    rn <- unlist(as.numeric(row.names(df[1:y,])))
    matchvals <- rn[which(grepl("imp", as.character(df$V1)))]
    matchvals <- matchvals[which(!is.na(matchvals))]
    maxmatch <- max(matchvals)
    out_text <- as.character(df$V1[maxmatch])
  })
  xout <- paste(xout, collapse = ", ")
  return(unlist(as.character(xout)))
})

> lx
$`2`
[1] "imp: xyz"

$`4`
[1] "imp: abc, imp: xyz"

Finally, use the names of the list items to update the V4 column with the appropriate values.

df$V4[as.numeric(names(lx))] = lx

> df
        V1 V2   V3                 V4
1 imp: abc  1 <NA>                 NA
2    apple  2    7           imp: xyz
3           3 <NA>                 NA
4 imp: xyz  4  2,8 imp: abc, imp: xyz
5           5 <NA>                 NA
6    lemon  6 <NA>                 NA
7     ball  7 <NA>                 NA
8     bell  8 <NA>                 NA

edited Jun 27 '17 at 13:47

answered Jun 26 '17 at 20:48

Luke C

10,081
1
14
21

it gives me this error " Error in 1:y : NA/NaN argument " – ETA123 Jun 27 '17 at 12:52
With your full data set or with the example one? – Luke C Jun 27 '17 at 12:54
example one same as above – ETA123 Jun 27 '17 at 13:08
Sorry, I dropped a few lines somehow. Edited answer to include those, please let me know if it still doesn't work! – Luke C Jun 27 '17 at 13:46
If I want the grep function to search for strings exactly, what can I do? for example, string = c("apple", "apples", "applez") grep("apple", string) would return all three, but just wanted one with exactly "apple". I know this is possible by using "word boundary \b", it works in the **string ** example but, in our case, what we can do ? – ETA123 Jun 27 '17 at 18:44
Were you looking at [this answer](https://stackoverflow.com/questions/26813667/how-to-use-grep-to-find-exact-match) for the "apples" "applez" example? The anchors given there by Avinash Raj should work: for example, change `"imp"` in the answer above to `"^imp: abc&"` to only search for exact matches to "imp: abc", down to the spaces. What was the string you were looking to match? – Luke C Jun 27 '17 at 19:05
lapply(V4, function(x) df$V2[grep(x,df$V1, fixed = TRUE)])) I have a line as shown above, So I cannot really use "^x&", right? In such cases how can I achieve the same? – ETA123 Jun 27 '17 at 19:10
I don't know that I understand- I don't know what V4 is in this case. But, if you are iterating over a list of strings, you could use the anchor by using `paste` as in the other answer ([linked here](https://stackoverflow.com/questions/26813667/how-to-use-grep-to-find-exact-match)). You may want to ask a new question or edit your existing one for clarity (and it's much easier to answer to that). – Luke C Jun 27 '17 at 19:26

Search for the nearest string

1 Answers1