Python - regex tokenizer with conditions

Question

I'm trying to remove punctuation while tokenizing a sentence in python but I have several "condtitions" where I want it to ignore tokenizing using punctuation. Some examples are when I see a URL, or email address or certain symbols without spaces next to them. Example:

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")

tokenizer.tokenize("please help me ignore punctuation like . or , but at the same time don't ignore if it looks like a url i.e. google.com or google.co.uk. Sometimes I also want conditions where I see an equals sign between words such as myname=shecode")

Right now the output looks like

['please', 'help', 'me', 'ignore', 'punctuation', 'like', 'or', 'but', 'at', 'the', 'same', 'time', "don't", 'ignore', 'if', 'it', 'looks', 'like', 'a', 'url', 'i', 'e', 'google', 'com', 'or', 'google', 'co', 'uk', 'Sometimes', 'I', 'also', 'want', 'conditions', 'where', 'I', 'see', 'an', 'equals', 'sign', 'between', 'words', 'such', 'as', 'myname', 'shecode']

But what I really want it to look like is

['please', 'help', 'me', 'ignore', 'punctuation', 'like', 'or', 'but', 'at', 'the', 'same', 'time', "don't", 'ignore', 'if', 'it', 'looks', 'like', 'a', 'url', 'i', 'e', 'google.com', 'or', 'google.co.uk', 'Sometimes', 'I', 'also', 'want', 'conditions', 'where', 'I', 'see', 'an', 'equals', 'sign', 'between', 'words', 'such', 'as', 'myname=shecode']

Try using "from nltk.tokenize import word_tokenize". I am not sure if it will solve your purpose. But try it once. Thanks. — Gunjan, Oct 16 '17 at 05:42
You should a) pre-tokenize the input on spaces; b) check each piece to decide if it is a url or not; and c) handle urls and non-url tokens differently. — alexis, Oct 16 '17 at 17:28

score 2 · Answer 1 · answered Oct 16 '17 at 06:57

You can use the a more complex regex tokenize, e.g. TreebankTokenizer from nltk.word_tokenize, see How do I tokenize a string sentence in NLTK?:

>>> from nltk import word_tokenize
>>> text ="please help me ignore punctuation like . or , but at the same time don't ignore if it looks like a url i.e. google.com or google.co.uk. Sometimes I also want conditions where I see an equals sign between words such as myname=shecode"
>>> word_tokenize(text)
['please', 'help', 'me', 'ignore', 'punctuation', 'like', '.', 'or', ',', 'but', 'at', 'the', 'same', 'time', 'do', "n't", 'ignore', 'if', 'it', 'looks', 'like', 'a', 'url', 'i.e', '.', 'google.com', 'or', 'google.co.uk', '.', 'Sometimes', 'I', 'also', 'want', 'conditions', 'where', 'I', 'see', 'an', 'equals', 'sign', 'between', 'words', 'such', 'as', 'myname=shecode']

And if you would like to remove the stopwords, see Stopword removal with NLTK

>>> from string import punctuation
>>> from nltk.corpus import stopwords
>>> from nltk import word_tokenize

>>> stoplist = stopwords.words('english') + list(punctuation)

>>> text ="please help me ignore punctuation like . or , but at the same time don't ignore if it looks like a url i.e. google.com or google.co.uk. Sometimes I also want conditions where I see an equals sign between words such as myname=shecode"

>>> word_tokenize(text)
['please', 'help', 'me', 'ignore', 'punctuation', 'like', '.', 'or', ',', 'but', 'at', 'the', 'same', 'time', 'do', "n't", 'ignore', 'if', 'it', 'looks', 'like', 'a', 'url', 'i.e', '.', 'google.com', 'or', 'google.co.uk', '.', 'Sometimes', 'I', 'also', 'want', 'conditions', 'where', 'I', 'see', 'an', 'equals', 'sign', 'between', 'words', 'such', 'as', 'myname=shecode']

>>> [token for token in word_tokenize(text) if token not in stoplist]
['please', 'help', 'ignore', 'punctuation', 'like', 'time', "n't", 'ignore', 'looks', 'like', 'url', 'i.e', 'google.com', 'google.co.uk', 'Sometimes', 'I', 'also', 'want', 'conditions', 'I', 'see', 'equals', 'sign', 'words', 'myname=shecode']

score 0 · Answer 2 · edited Oct 16 '17 at 05:14

0

Change your regex to the following expression

tokenizer = RegexpTokenizer("[\w+.]+")

In regex . means any character.

So in your code, it's splitting on . also. So the new regex will prevent the splitting on .

edited Oct 16 '17 at 05:14

chaitan64arun

783
8
20

answered Oct 16 '17 at 04:29

arjunsv3691

791
6
19

Hi, sometimes i want it to split on it though, it will be conditional. perhaps if we see a ".com" or .co." then we don't want it to be split, does that make sense? – shecode Oct 16 '17 at 04:50
in regex `.` means any character, except between bracket `[`and `]` – Indent Oct 16 '17 at 06:19

Gunjan · Answer 3 · 2017-10-17T05:05:55.513

Try using this code, if it works for you.

from nltk.tokenize import word_tokenize
punct_list = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
s = "please help me ignore punctuation like . or , but at the same time don't ignore if it looks like a url i.e. google.com or google.co.uk. Sometimes I also want conditions where I see an equals sign between words such as myname=shecode"
print [i.strip("".join(punct_list)) for i in word_tokenize(s) if i not in punct_list]

Check this How to remove punctuation? as well

Python - regex tokenizer with conditions

3 Answers3