Tokenize based on white space and trailing punctuation?

Question

I'm trying to come up with the regular expression to split a string up into a list based on white space or trailing punctuation.

e.g.

s = 'hel-lo  this has whi(.)te, space. very \n good'

What I want is

['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

s.split() gets me most of the way there, except it doesn't take care of the trailing whitespace.

Do you allow to use other libraries too? Or you want to use just regular expression? — titipata, Apr 27 '17 at 01:59

score 3 · Accepted Answer · answered Apr 27 '17 at 02:26

3

import re
s = 'hel-lo  this has whi(.)te, space. very \n good'
[x for x in re.split(r"([.,!?]+)?\s+", s) if x]
# => ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

You might need to tweak what "punctuation" is.

answered Apr 27 '17 at 02:26

Amadan

191,408
23
240
301

score 0 · Answer 2 · edited May 23 '17 at 12:34

Rough solution using spacy. It works pretty good with tokenizing word already.

import spacy
s = 'hel-lo  this has whi(.)te, space. very \n good'
nlp = spacy.load('en') 
ls = [t.text for t in nlp(s) if t.text.strip()]

>> ['hel', '-', 'lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

However, it also tokenize words between - so I borrow solution from here to merge words between - back together.

merge = [(i-1, i+2) for i, s in enumerate(ls) if i >= 1 and s == '-']
for t in merge[::-1]:
    merged = ''.join(ls[t[0]:t[1]])
    ls[t[0]:t[1]] = [merged]

>> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

Ancora Imparo · Answer 3 · 2017-04-27T04:10:30.400

0

I am using Python 3.6.1.

import re

s = 'hel-lo  this has whi(.)te, space. very \n good'
a = [] # this list stores the items
for i in s.split(): # split on whitespaces
    j = re.split('(\,|\.)$',i) # split on your definition of trailing punctuation marks
    if len(j) > 1:
        a.extend(j[:-1])
    else:
        a.append(i)
 # a -> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

edited Apr 27 '17 at 04:10

answered Apr 27 '17 at 04:03

Ancora Imparo

331
1
6
20

Tokenize based on white space and trailing punctuation?

3 Answers3