Getting rid of beginning and ending characters when using re.split()

Question

I am trying to understand re.split(), I want to ignore comma separators, periods, and dashes.

What I am not understanding is why I get an empty string at the end of my result.

And I cannot seem to figure out how to ignore say a comma.

Here is my test code:

sntc = 'this is a sentence total $5678 fees: expenses $123,345 why not -2345 hey.'

test = re.split('\D*', sntc) 
print(test)

I get the following output:

['', '5678', '123', '345', '2345', '']

Obviously, split picks up too much. I can deal with that by using a different Regex approach, but what I can’t figure out is why '' is on either end of the result.

score 0 · Answer 1 · answered Jan 17 '19 at 04:13

Because split looks for the regex to match separators - and hey. matches the regex, but also separates 2345 from the end of the string.

So what you're getting is '2345 hey.' being split into '2345' and '', with ' hey.' in between them.

Similarly, if your separator was a and you had the string aba you'd get the result ['', 'b', ''] because a separates the beginning and end of the string from the b in the middle.

score 0 · Answer 2 · answered Jan 17 '19 at 04:38

re.split() is explicit about this:

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:
>>> re.split(r'(\W+)', '...words, words...')`
['', '...', 'words', ', ', 'words', '...', ''`

I think you better use re.findall(r'\D+', sntc) here.

Til · Accepted Answer · 2019-01-17T04:51:10.927

I think you really want this:

>>> re.findall('\d+', sntc)
['5678', '123', '345', '2345']

Your regex has little problem, and can ends up like this:

>>> re.split('\D*', sntc)
['', '', '5', '6', '7', '8', '', '1', '2', '3', '', '3', '4', '5', '', '2', '3', '4', '5', '', '']

I think what you intended to do is:

>>> re.split('\D+', sntc)
['', '5678', '123', '345', '2345', '']

However, this is what split about, it splits things, even if it leaves nothing.
Consider CSVs, or TAB separated xls files.
It's designed like that. -- Even if no things between the commas or TABs, there're still those columns exists -- blank columns.

And the \D+ here, it's working like the comma or TAB, it will act as a column delimiter, no matter if you have things before it or not, it will denotes a new column is after it.
Same thing for the last \D+ matched contents, no matter if you have things follow it or not, it still denotes a new column after it.

score 0 · Answer 4 · answered Jan 17 '19 at 04:50

Than you Amber and Arount.

Here’s how I’ve implemented:

    whatup = sntce.replace(',', '')
#gets rid of thousands separators
testing = re.findall(r'[0-9,-.]+', whatup)
#gets rid of everything but the pos and neg numbers.

And I guess I don’t need the comma in the re. I then cast the strings to numbers and off I go.

Getting rid of beginning and ending characters when using re.split()

4 Answers4