Finding similar words (plurals and singulars) using grep in R

Question

I have a character variable with many words. For example...

    words   
1   funnel  
2   funnels
3   sprout
4   sprouts
5   sprouts.
6   chicken
7   chicken)
8   chicken(2)

Many of the words are the same, just with an s on the end or symbol (), .) as a type

I want to find words that are plurals/singulars of each other, so I can remove the s from the end and remain with only singular values.

I also want to remove all the symbols from the end which are typos. For example,
* remove chicken) because it is not a balanced parathesis * but preserve chicken(2)

my current attempt has been

# Find words that end in `s`
grep("s$", df$words, ignore.case = TRUE, value = T)
# Remove the `s` from the end of words
df$words <- gsub("s$", "", df$words, ignore.case = T)
# Remove any typos with symbols at the the end of a word
gsub("[^A-z|0-9]|$", "", df$words)

My final code also includes words such as chicken(2), which I do not wish to edit.

This shows me many plural words (words that end in s), however I have no idea if there is a singular version (the same word without the s).
How can I find words that end in grammar symbols / punctuations marks typos and remove those? (i.e. (, ., !). i.e remove unbalanced parentheses such as chicken), but not chicken(2)

For example...

    words   
1   funnel  
2   funnel
3   sprout
4   sprout
5   sprout
6   chicken
7   chicken
8   chicken(2)

Well, you really made a typo. `df$words`, not `df$wprds`. Look at [this demo](https://ideone.com/4cw8Gd). However, that really does not check if there are 2 words that only differ in final `s`. — Wiktor Stribiżew, Jan 14 '16 at 13:04
this is an example dataset, the issue was that gsub is case sensitive, but i've fixed that. I now need to remove the symbols on the end of the words — user3200293, Jan 14 '16 at 13:11
`gsub("[^A-z]$","",df$words)` should remove a single non letter from the end — R. Schifini, Jan 14 '16 at 13:15
@R.Schifini: [\[A-z\] and \[a-zA-Z\] difference](http://stackoverflow.com/questions/4923380/difference-between-regex-a-z-and-a-za-z). To remove all non-letters and non-digits, use `[^[:alnum:]]+$`. Try `gsub("[^[:alnum:]]+$", "", df$words)` — Wiktor Stribiżew, Jan 14 '16 at 13:17
`grep("[^A-z|0-9]$",tmp$EVTYPE, value = T)` I thought I found a solution using, however it includes words such as `cat(2)` which are fine. I only want to remove typos, such as `cat)`. How can I preserve words which end in a pair of brackets with a digit inside for example . remove `cat)` preserve `cat(2)` — user3200293, Jan 14 '16 at 13:22
Do you mean you only want to keep words that can have (not necessarily) balanced parentheses? And you actually wanted to use `[^A-Za-z0-9]$`. — Wiktor Stribiżew, Jan 14 '16 at 13:28
yes, I want to find words that do not have balanced parentheses, as well as words that end in fullstops or slashes. — user3200293, Jan 14 '16 at 13:32
:) That is a nightmare. See this demo: [`df$words <- gsub("(?s)^(?![^()]*+(\$(?>[^()]|(?1))*+\$[^()]*+)++)(.*?)[^[:alnum:]]*$", "\\2", df$words, perl=T)`](https://ideone.com/XpZYq4) — Wiktor Stribiżew, Jan 14 '16 at 13:39

score 2 · Answer 1 · answered Jan 14 '16 at 15:30

2

The str_replace_all function from the stringr package will successively apply patterns and replacements to a vector of strings. You might try

library(stringr)
str_replace_all(words, c("[^[:alnum:]]$" = "",  "s$" = "", "(\\(\\d*)" = "\\1\\)" ))

answered Jan 14 '16 at 15:30

WaltS

5,410
2
18
24

Finding similar words (plurals and singulars) using grep in R

1 Answers1