2

I have different Strings (one String may contain ~1-4 words) stored in a Large Character Object (38506 elements in total) and a set of 10 texts stored in a chr-Object (about 100 words each) that might contain one of the Strings from the Large Character Object.

Now I would like to extract possible matches from the text set for every String.

I already tried the following, with "a4" being the set of texts (chr-object) and "t" being the Large Character Object:

i = 1
 while(i < 38506){
    matches <- str_extract(a4, t[i])
    i <- i +1 }

However, after the operation, the object "matches" contains only 10 "NA"-elements, although there are definitely a few matching strings in some of the texts.

|| EDIT2:

Here's an reproducible example of what I am trying to do with x representing the Large Character Object and z representing the set of texts

Please note that the while-loop currently does not produce the outcome as displayed below, the example illustrates what the result should look like.

The actual resulting object contains only 8 NA - elements, so there must be some error in the loop, the str_extract or the pmax-function:

> x
[1] "Hey-ho!"              "This is"              "Just some random"    
[4] "text"                 "I am trying to match" "please help"         
[7] "very nice"                   "Thanks"  



z
[1] "My name is Thomas. This is my first project"
[2] "R is a cool tool"  
[3] "Hello, Hi There and Hey-ho!"
[4] "Can you please help me clean this mess?"    
[5] "All the best!" 
[6] "Is there a way to get to London by train?"


i <- 1
while(i < length(x)){
extraction <- str_extract(z, x[i])
resulting <- pmax(resulting, extraction)

i <- i + 1
}



> resulting
[1] "This is" NA   "Hey-ho"   "please help" NA   NA 

If someone wants to try exactly what I am doing, I have uploaded my actual data into a dropbox folder: https://www.dropbox.com/sh/2y7ogjxk1glddh1/AADrDveQguzChaaXXIeLfmIfa?dl=0

I read the files into R like this:

a4 <- readLines(file.path(".","a4.txt"))

t <- readLines(file.path(".","LargeCharacterObject.txt"))

Due to some formating issues the following replacements should be conducted before trying to match the strings:

a4 <- gsub('Ãœ', 'Ü', a4)
a4 <- gsub('Ãœ', 'Ü', a4)
a4 <- gsub('Ä', 'Ä', a4)
a4 <- gsub('ß', 'ß', a4)
a4 <- gsub('ä', 'ä', a4)
a4 <- gsub('ü', 'ü', a4)
a4 <- gsub('ö', 'ö', a4)
a4 <- gsub('Ö', 'Ö', a4)

t <- gsub('Ãœ', 'Ü', t)
t <- gsub('Ä', 'Ä', t)
t <- gsub('ß', 'ß', t)
t <- gsub('ä', 'ä', t)
t <- gsub('ü', 'ü', t)
t <- gsub('ö', 'ö', t)
t <- gsub('Ö', 'Ö', t)
t <- gsub('\\', '', t)

EDIT2 END ||

Do I somehow need to wrap t[i] in a Regex-pattern? Is this even feasible? Or am I using the wrong type of objects / the wrong extraction method?

I am grateful for any hints or ideas.

Thanks

EDIT

I forgot to mention earlier that the elements of the array should stay in the same order and contain also the elements without matches, so the result should look something like:

[1] "NA" "NA" "a" "NA" "b" "NA"

I already tried this:

i = 1
while(i < 38506){
 matches <- (str_extract(a4, t[i])
 result <- pmax(matches, result)
 i <- i +1}

But somehow "result" also contains only 10 "NA" elements after execution.

acylam
  • 18,231
  • 5
  • 36
  • 45
WebScraper
  • 29
  • 7

1 Answers1

1

Putting aside other possible changes to your code, it's not doing as you expect because you are over-writing matches each time rather than appending to it.

Thus, this will likely work for you.

i = 1
while(i < 38506){
     matches <- c(matches, str_extract(a4, t[i]))
     i <- i +1
}

To demonstrate with a reproducible example, here is an analogy to what you are currently doing.

matches <- character()
for(l in letters){
    matches <- l
}
matches
# [1] "z"

This what you should be doing in this analogous example.

matches <- character()
for(l in letters){
    print(l)
    matches <- c(matches, l)
}
matches
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
# [23] "w" "x" "y" "z"
jmuhlenkamp
  • 2,102
  • 1
  • 14
  • 37
  • Thank you for the detailed answer. I forgot to mention in my post that the elements of the array should stay in the same order and contain also the elements without matches, so the result should look something like: [1] "NA" "NA" "a" "NA" "b" "NA" I already tried this: `i = 1 while(i < 38506){ matches <- (str_extract(a4, t[i]) result <- pmax(matches, result) i <- i +1 }` But somehow "result" also contains only 10 "NA" elements after execution. – WebScraper Dec 19 '17 at 01:55
  • 1
    It will be difficult to help you further without a reproducible example. See here for more info: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example . Specifically, can you subset your data and post here? Otherwise you are not likely to get the true answer you're looking for. @WebScraper – jmuhlenkamp Dec 19 '17 at 02:06
  • Thanks for the feedback. I tried to add an example to my post, hope this helps. – WebScraper Dec 19 '17 at 13:00
  • Update: I also added a Link to a dropbox-folder containing the text-files with the original data sets I am trying to match. – WebScraper Dec 19 '17 at 13:59
  • Update2: I also tried the first piece of code that you provided: `i = 1 while(i < 38506){ matches <- c(matches, str_extract(a4, t[i])) i <- i +1 }` But it only returns "matches" as a LargeCharacterObject with 38506 elements, each of them containing 23 NA-elements. @jmuhlenkamp – WebScraper Dec 19 '17 at 14:24