I have different Strings (one String may contain ~1-4 words) stored in a Large Character Object (38506 elements in total) and a set of 10 texts stored in a chr-Object (about 100 words each) that might contain one of the Strings from the Large Character Object.
Now I would like to extract possible matches from the text set for every String.
I already tried the following, with "a4" being the set of texts (chr-object) and "t" being the Large Character Object:
i = 1
while(i < 38506){
matches <- str_extract(a4, t[i])
i <- i +1 }
However, after the operation, the object "matches" contains only 10 "NA"-elements, although there are definitely a few matching strings in some of the texts.
|| EDIT2:
Here's an reproducible example of what I am trying to do with x representing the Large Character Object and z representing the set of texts
Please note that the while-loop currently does not produce the outcome as displayed below, the example illustrates what the result should look like.
The actual resulting object contains only 8 NA - elements, so there must be some error in the loop, the str_extract or the pmax-function:
> x
[1] "Hey-ho!" "This is" "Just some random"
[4] "text" "I am trying to match" "please help"
[7] "very nice" "Thanks"
z
[1] "My name is Thomas. This is my first project"
[2] "R is a cool tool"
[3] "Hello, Hi There and Hey-ho!"
[4] "Can you please help me clean this mess?"
[5] "All the best!"
[6] "Is there a way to get to London by train?"
i <- 1
while(i < length(x)){
extraction <- str_extract(z, x[i])
resulting <- pmax(resulting, extraction)
i <- i + 1
}
> resulting
[1] "This is" NA "Hey-ho" "please help" NA NA
If someone wants to try exactly what I am doing, I have uploaded my actual data into a dropbox folder: https://www.dropbox.com/sh/2y7ogjxk1glddh1/AADrDveQguzChaaXXIeLfmIfa?dl=0
I read the files into R like this:
a4 <- readLines(file.path(".","a4.txt"))
t <- readLines(file.path(".","LargeCharacterObject.txt"))
Due to some formating issues the following replacements should be conducted before trying to match the strings:
a4 <- gsub('Ãœ', 'Ü', a4)
a4 <- gsub('Ãœ', 'Ü', a4)
a4 <- gsub('Ä', 'Ä', a4)
a4 <- gsub('ß', 'ß', a4)
a4 <- gsub('ä', 'ä', a4)
a4 <- gsub('ü', 'ü', a4)
a4 <- gsub('ö', 'ö', a4)
a4 <- gsub('Ö', 'Ö', a4)
t <- gsub('Ãœ', 'Ü', t)
t <- gsub('Ä', 'Ä', t)
t <- gsub('ß', 'ß', t)
t <- gsub('ä', 'ä', t)
t <- gsub('ü', 'ü', t)
t <- gsub('ö', 'ö', t)
t <- gsub('Ö', 'Ö', t)
t <- gsub('\\', '', t)
EDIT2 END ||
Do I somehow need to wrap t[i]
in a Regex-pattern? Is this even feasible?
Or am I using the wrong type of objects / the wrong extraction method?
I am grateful for any hints or ideas.
Thanks
EDIT
I forgot to mention earlier that the elements of the array should stay in the same order and contain also the elements without matches, so the result should look something like:
[1] "NA" "NA" "a" "NA" "b" "NA"
I already tried this:
i = 1
while(i < 38506){
matches <- (str_extract(a4, t[i])
result <- pmax(matches, result)
i <- i +1}
But somehow "result" also contains only 10 "NA" elements after execution.