0

I am trying in R to find the spanish words in a number of words. I have all the spanish words from a excel that I don´t know how to attach in the post (it has more than 80000 words), and I am trying to check if some words are on it, or not.

For example:

words = c("Silla", "Sillas", "Perro", "asdfg")

I tried to use this solution:

grepl(paste(spanish_words, collapse = "|"), words) 

But there is too much spanish words, and gives me this error:

Error

So... who can i do it? I also tried this:

toupper(words) %in% toupper(spanish_words)

Result

As you can see with this option only gives TRUE in exactly matches, and I need that "Sillas" also appear as TRUE (it is the plural word of silla). That was the reason that I tried first with grepl, for get plurals aswell.

Any idea?

zx8754
  • 52,746
  • 12
  • 114
  • 209
GonzaloReig
  • 77
  • 1
  • 6
  • I don't think regex/grepl is the solution here, maybe use dedicated packages, `tm` package comes to mind, there might be more specialised dictionary style packages, too. – zx8754 Jul 11 '19 at 07:11
  • 1
    The error thrown by `grepl` is not because you have too many words in your pattern but because your regex is not valid. – Junitar Jul 11 '19 at 07:34
  • You need to put your regex in parentheses. You have `word|word2|word3...` but you should have `(word|word2|word3...)`. – January Jul 11 '19 at 07:43

1 Answers1

1

As df:

df <- tibble(text = c("some words", 
                      "more words", 
                      "Perro", 
                      "And asdfg", 
                      "Comb perro and asdfg"))

Vector of words: words <- c("Silla", "Sillas", "Perro", "asdfg") words <- tolower(paste(words, collapse = "|"))

Then use mutate and str_detect:

df %>% 
  mutate(
   text = tolower(text), 
   spanish_word = str_detect(text, words)
 )

Returns:

text                 spanish_word
  <chr>                <lgl>       
1 some words           FALSE       
2 more words           FALSE       
3 perro                TRUE        
4 and asdfg            TRUE        
5 comb perro and asdfg TRUE    
tvdo
  • 151
  • 3