-1

I am trying to split a string using multiple delimiters. I need to keep the delimiters as words. The delimiters I am using are: all the punctuations marks and the space.

For example, the string:

Je suis, FOU et toi ?!

Should produce:

'Je'
'suis'
','
'FOU'
'et'
'toi'
'?'
'!'

I wrote:

class Parser :
    def __init__(self) :
        """Empty constructor"""

    def read(self, file_name) :
        from string import punctuation
        with open(file_name, 'r') as file :
            for line in file :
                for word in line.split() :
                    r = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
                    print(r.split(word))

But the result I got is:

['Je']
['suis', '']
['FOU']
['et']
['toi']
['', '']

The split seems to be correct, but the result list do not contains the delimiters :(

hiveship
  • 308
  • 2
  • 7
  • 21

1 Answers1

2

You need to put your expression into a group for re.split() to preserve it. I'd not split on whitespace first; you can always remove whitespace-only strings later. If you want each punctuation character separate then you should use the + quantifier on the \s whitespace group only:

# do this just once, not in a loop
pattern = re.compile(r'(\s+|[{}])'.format(re.escape(punctuation)))

# for each line
parts = [part for part in pattern.split(line) if part.strip()]

The list comprehension removes anything that consists only of whitespace:

>>> import re
>>> from string import punctuation
>>> line = 'Je suis, FOU et toi ?!'
>>> pattern = re.compile(r'(\s+|[{}])'.format(re.escape(punctuation)))
>>> pattern.split(line)
['Je', ' ', 'suis', ',', '', ' ', 'FOU', ' ', 'et', ' ', 'toi', ' ', '', '?', '', '!', '']
>>> [part for part in pattern.split(line) if part.strip()]
['Je', 'suis', ',', 'FOU', 'et', 'toi', '?', '!']

Rather than split, you can also use re.findall() to find all word or punctuation sequences:

pattern = re.compile(r'\w+|[{}]'.format(re.escape(punctuation)))

parts = pattern.findall(line)

This has the advantage that you don't need to filter out whitespace:

>>> pattern = re.compile(r'\w+|[{}]'.format(re.escape(punctuation)))
>>> pattern.findall(line)
['Je', 'suis', ',', 'FOU', 'et', 'toi', '?', '!']
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343