-1

I have a list/vector ("x") of 1000 smaller vectors of 1 line each. These sub vectors include strings and numbers. One of the lines includes the "id: XXXX" variable which is embedded within strings. I can use the following piece of code in R to combine successive vectors within the list if I am only considering the first 2 vectors (i.e. x[[i]] and x[[i+1]]).


first_vec<-c("Page 1 of 1000", "Report of vectors within a list", "id: 1234     height: 164 cms", "health: good")

second_vec<-c("Page 2 of 1000", "Report of vectors within a list", "id: 1235     height: 180 cms", "health: moderate")

third_vec<-c("Page 3 of 1000", "Report of vectors within a list", "id: 1235     weight: 200 pounds", "health: moderate")

x<-list(first_vec, second_vec, third_vec)
X <- for (i in i:unique(length(x))) {
  t1 <- unlist(stringr::str_extract_all(x[[i]][!is.na(sample)], "(id: [0-9]+)"))
  t2 <- unlist(stringr::str_extract_all(x[[i + 1]][!is.na(sample)], "(id: [0-9]+)"))
  if (t1 == t2) {
    c(x[[i]], x[[i + 1]])
  }
}

The desired result is:

 x<-list(first_vec, c(second_vec, third_vec)

This works for me when I have just two subvectors. However, I have a list of 1000 vectors. How can I loop the above piece of code across all the vectors within the list x?

At the moment I get the following error message: Warning in is.na(sample) : is.na() applied to non-(list or vector) of type 'closure' Error in x[[i + 1]] : subscript out of bounds

I am including an example of a typical input file I am applying the code to. In the example below, I would like to combine pages 2 and 3, since the ids match.

2 Answers2

1

Without knowing your data.

You can 1) extract your strings, 2) look for successive ids like this

library(stringr)
xx <- unique(x)
# loop over the xx vector and extract the ids
ids <- sapply(xx, function(s) str_extract(s, "\(id: [0-9]+\)"))

# filter for successive values
suc_ids <- ids[ids == lag(ids)]
David
  • 9,216
  • 4
  • 45
  • 78
1

Here's my understanding of your problem and a solution to it: You have a list of single-string vectors and want to concatenate those substrings that match a pattern. If that's correct then this should work:

Data:

a <- "id: 20"
b <- "something id: 333some more"
c <- "some other stuff without id"
d <- "some stuff id: 346999 and more stuff"
x <- list(a,b,c,d)

unlist(stringr::str_extract(x, "id: [0-9]+"))
[1] "id: 20"     "id: 333"    NA           "id: 346999"

or (perhaps):

paste0(unlist(stringr::str_extract(x, "id: [0-9]+")), collapse = ", ")
"id: 20, id: 333, NA, id: 346999"

Based on OP's updated data:

paste0(unlist(stringr::str_extract_all(x, "Page \\d+")), " ", unlist(stringr::str_extract_all(x, "id: [0-9]+")), collapse = ", ")
[1] "Page 1 id: 1234, Page 2 id: 1235, Page 3 id: 1235"
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • Brilliant, Thanks. Could I edit the paste0 command above in a manner, that the remaining string around the "id" variable remains intact? This would mean that x[[1]] in my example would be page 1 and x[[i+1]] would be a combination of pages 2 and 3. Many Thanks – biostats_guy Mar 01 '21 at 16:12
  • How is your data really structured? For me to answer this question you'd have to post a snippet of your data; try using `dput(head())` – Chris Ruehlemann Mar 01 '21 at 16:20
  • Apologies, I could not share via dput as I have confidentiality issues hindering me. – biostats_guy Mar 01 '21 at 16:55
  • then just make up some data that resemble yours! – Chris Ruehlemann Mar 01 '21 at 17:02
  • Thanks, I have edited the question. Apologies for the inconvenience – biostats_guy Mar 01 '21 at 17:41
  • 1
    Have edited the answer. This what you need? – Chris Ruehlemann Mar 01 '21 at 18:00
  • Thanks a lot. What I am trying to achieve is to end up with a similar list to the "x" list that I have generated above, except that element 3 or x[[3]] would now be appended to x[[2]]. So I would end up with an x which would be a list of (first vector and a combination of (second and third vectors)). – biostats_guy Mar 01 '21 at 18:14
  • I mean why describe at length what you want to have and thereby risking being misunderstood or not understood at all rather than just, as is customary on SO, post the desired result of your updated data? – Chris Ruehlemann Mar 01 '21 at 18:25
  • Apologies for the misunderstanding, I have not been posting questions for a while and have clearly been out of the loop with the usual etiquette.I have edited the question. The final output needs to look like:x<-list(first_vec, c(second_vec, third_vec) – biostats_guy Mar 01 '21 at 18:37
  • `x<-list(first_vec, c(second_vec, third_vec) `is not a result but an operation. – Chris Ruehlemann Mar 01 '21 at 18:40