Regex to get rid of the last term with conditions

Question

I would like to write a regex to remove the last character from a string, if the character is a (s).

However in doing so I would like to retain the (s) if it is preceded by another (s).

Example.

The output of Apples should be Apple.
The output of Process should be Process.

I need a regex that would capture the whole term if the expression is matched but would perform the replacement for a partial match.

I have used s$ to get rid of the last character.

See [*Good way to add terms to python pattern singularize*](http://stackoverflow.com/questions/23586591/good-way-to-add-terms-to-python-pattern-singularize) — Wiktor Stribiżew, May 17 '16 at 10:10
Regex is not a good solution for this problem, your best bet for a python based solution is `ntlk` — Javier Buzzi, May 17 '16 at 13:15

Javier Buzzi · Accepted Answer · 2016-05-17T13:14:02.240

4

This has been talked about WAY too many times, and the consensus is always: its WAY too complicated to be handled through a simple regex. All of the solutions fail with these examples:

apples
carrots
process
processes
tennis

A solution is to use morpha:

git clone https://github.com/knowitall/morpha
cd morpha/
flex -i -Cfea -8 -omorpha.yy.c morpha.lex
gcc -o morpha morpha.yy.c
curl -s https://raw.githubusercontent.com/jhlau/predom_sense/master/lemmatiser_tools/morpha/verbstem.list > verbstem.list

now to test:

cat test.txt | ./morpha -c
apple
carrot
process
process
tennis

If you want a python solution, i suggest you go with nltk.

virtualenv env-nltk
source env-nltk/bin/activate
pip install nltk
python -c "import nltk; nltk.download()" # <- just get the whole thing, click "all" and then "download" on the "collections" tab

Now that everything is downloaded, lets fire off python and play with it.

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('apples')
u'apple'
>>> lmtzr.lemmatize('tennis')
'tennis'
>>> lmtzr.lemmatize('process')
'process'
>>> lmtzr.lemmatize('processes')
u'process'

edited May 17 '16 at 13:14

answered May 17 '16 at 12:18

Javier Buzzi

6,296
36
50

1

Hi Javier, thanks for the answer. I have been using snowball stemmer, however snowball stemmer stems even the 'e' if preceded by a 's'. I didnt actually think of WordNetLemmatizer. Thanks for pointing it out – Sam May 18 '16 at 10:43
One other thing is that WordNetLemmatizer wouldn't take care of "apple's". Do you know of any other stemmer that could handle this scenario and the earlier ones (apples, process) – Sam May 18 '16 at 10:50
@Sam maybe https://pypi.python.org/pypi/inflect ?? i don't know, my naive first attempt would be to say: `re.sub("'(s?)($| )", '\g<1>\g<2>', "apples' apple's")` run it through here first then, `.lemmatize()` it -- if that doesn't help/work let me know and i could do a bit more research – Javier Buzzi May 18 '16 at 11:14
1

Thanks Javier, Anyways, I have prepossessed my sentences to replace (') with a blank, so the "apple's" becomes "apples" and then I use lemmatize. lemme check if all works perfect. I will let you know if I face any problem. Thanks a lot – Sam May 18 '16 at 12:29

anubhava · Answer 2 · 2016-05-17T10:11:30.303

2

You can use this negative lookbehind assertion:

(?<!s)s$

RegEx Demo

Breakup:

(?<!s)  # assert previous position doesn't have 's'
s       # match 's'
$       # assert end of line

edited May 17 '16 at 10:11

answered May 17 '16 at 10:10

anubhava

761,203
64
569
643

1

`Processes` will result in `Processe` – Wiktor Stribiżew May 17 '16 at 10:11
Right there can be other words in English like that also. It is as per the OP's requirement *I would like to retain the (s) if it is preceded by another (s)* – anubhava May 17 '16 at 10:12
1

I would hardly call it a requirement, i would say more like a very superficial example. This "solution" only seems to work with those two examples, but fails with "tennis" and like @WiktorStribiżew said; "Processes" – Javier Buzzi May 17 '16 at 13:16

score 0 · Answer 3 · answered May 17 '16 at 10:13

0

You could use negative lookbehind assertion to ensure substitution happens only if s is not preceded by another s.

>>> import re
>>> re.sub(r'(?<!s)s$', '', 'Apples')
'Apple'
>>> re.sub(r'(?<!s)s$', '', 'Process')
'Process'

answered May 17 '16 at 10:13

riteshtch

8,629
4
25
38

Regex to get rid of the last term with conditions

3 Answers3