0

I'm trying to come up with the regular expression to split a string up into a list based on white space or trailing punctuation.

e.g.

s = 'hel-lo  this has whi(.)te, space. very \n good'

What I want is

['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

s.split() gets me most of the way there, except it doesn't take care of the trailing whitespace.

martineau
  • 119,623
  • 25
  • 170
  • 301
Kewl
  • 3,327
  • 5
  • 26
  • 45

3 Answers3

3
import re
s = 'hel-lo  this has whi(.)te, space. very \n good'
[x for x in re.split(r"([.,!?]+)?\s+", s) if x]
# => ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

You might need to tweak what "punctuation" is.

Amadan
  • 191,408
  • 23
  • 240
  • 301
0

Rough solution using spacy. It works pretty good with tokenizing word already.

import spacy
s = 'hel-lo  this has whi(.)te, space. very \n good'
nlp = spacy.load('en') 
ls = [t.text for t in nlp(s) if t.text.strip()]

>> ['hel', '-', 'lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

However, it also tokenize words between - so I borrow solution from here to merge words between - back together.

merge = [(i-1, i+2) for i, s in enumerate(ls) if i >= 1 and s == '-']
for t in merge[::-1]:
    merged = ''.join(ls[t[0]:t[1]])
    ls[t[0]:t[1]] = [merged]

>> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
Community
  • 1
  • 1
titipata
  • 5,321
  • 3
  • 35
  • 59
0

I am using Python 3.6.1.

import re

s = 'hel-lo  this has whi(.)te, space. very \n good'
a = [] # this list stores the items
for i in s.split(): # split on whitespaces
    j = re.split('(\,|\.)$',i) # split on your definition of trailing punctuation marks
    if len(j) > 1:
        a.extend(j[:-1])
    else:
        a.append(i)
 # a -> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
Ancora Imparo
  • 331
  • 1
  • 6
  • 20