1

I have a character variable with many words. For example...

    words   
1   funnel  
2   funnels
3   sprout
4   sprouts
5   sprouts.
6   chicken
7   chicken)
8   chicken(2)

Many of the words are the same, just with an s on the end or symbol (), .) as a type

I want to find words that are plurals/singulars of each other, so I can remove the s from the end and remain with only singular values.

I also want to remove all the symbols from the end which are typos. For example,
* remove chicken) because it is not a balanced parathesis * but preserve chicken(2)

my current attempt has been

# Find words that end in `s`
grep("s$", df$words, ignore.case = TRUE, value = T)
# Remove the `s` from the end of words
df$words <- gsub("s$", "", df$words, ignore.case = T)
# Remove any typos with symbols at the the end of a word
gsub("[^A-z|0-9]|$", "", df$words)

My final code also includes words such as chicken(2), which I do not wish to edit.

  1. This shows me many plural words (words that end in s), however I have no idea if there is a singular version (the same word without the s).

  2. How can I find words that end in grammar symbols / punctuations marks typos and remove those? (i.e. (, ., !). i.e remove unbalanced parentheses such as chicken), but not chicken(2)

For example...

    words   
1   funnel  
2   funnel
3   sprout
4   sprout
5   sprout
6   chicken
7   chicken
8   chicken(2)
oguz ismail
  • 1
  • 16
  • 47
  • 69
user3200293
  • 181
  • 5
  • 18
  • Well, you really made a typo. `df$words`, not `df$wprds`. Look at [this demo](https://ideone.com/4cw8Gd). However, that really does not check if there are 2 words that only differ in final `s`. – Wiktor Stribiżew Jan 14 '16 at 13:04
  • this is an example dataset, the issue was that gsub is case sensitive, but i've fixed that. I now need to remove the symbols on the end of the words – user3200293 Jan 14 '16 at 13:11
  • `gsub("[^A-z]$","",df$words)` should remove a single non letter from the end – R. Schifini Jan 14 '16 at 13:15
  • I want to keep digits (0-9), but not symbols. – user3200293 Jan 14 '16 at 13:16
  • 2
    @R.Schifini: [\[A-z\] and \[a-zA-Z\] difference](http://stackoverflow.com/questions/4923380/difference-between-regex-a-z-and-a-za-z). To remove all non-letters and non-digits, use `[^[:alnum:]]+$`. Try `gsub("[^[:alnum:]]+$", "", df$words)` – Wiktor Stribiżew Jan 14 '16 at 13:17
  • `grep("[^A-z|0-9]$",tmp$EVTYPE, value = T)` I thought I found a solution using, however it includes words such as `cat(2)` which are fine. I only want to remove typos, such as `cat)`. How can I preserve words which end in a pair of brackets with a digit inside for example . remove `cat)` preserve `cat(2)` – user3200293 Jan 14 '16 at 13:22
  • Do you mean you only want to keep words that can have (not necessarily) balanced parentheses? And you actually wanted to use `[^A-Za-z0-9]$`. – Wiktor Stribiżew Jan 14 '16 at 13:28
  • yes, I want to find words that do not have balanced parentheses, as well as words that end in fullstops or slashes. – user3200293 Jan 14 '16 at 13:32
  • :) That is a nightmare. See this demo: [`df$words <- gsub("(?s)^(?![^()]*+(\\((?>[^()]|(?1))*+\\)[^()]*+)++)(.*?)[^[:alnum:]]*$", "\\2", df$words, perl=T)`](https://ideone.com/XpZYq4) – Wiktor Stribiżew Jan 14 '16 at 13:39
  • After `++` there must be a `$`. Updated the above demo. – Wiktor Stribiżew Jan 14 '16 at 13:49

1 Answers1

2

The str_replace_all function from the stringr package will successively apply patterns and replacements to a vector of strings. You might try

library(stringr)
str_replace_all(words, c("[^[:alnum:]]$" = "",  "s$" = "", "(\\(\\d*)" = "\\1\\)" ))
WaltS
  • 5,410
  • 2
  • 18
  • 24