Python split with multiple delimiters not working

Question

I have a string:

feature.append(freq_and_feature(text, freq))

I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.

These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):

import re
with open("helper.txt", "r") as helper:
    for row in helper:

       print re.split('\' .,()_', row)

However, I get the following, which is not what I want.

['    feature.append(freq_pain_feature(text, freq))\n']

Probably because the string described by the pattern doesn't exist. read a tutorial before, ask after. — Casimir et Hippolyte, Apr 12 '16 at 17:05
The first argument to `re.split()` must be a correct [regular expression](https://docs.python.org/3/library/re.html#regular-expression-syntax) that matches the delimiter. It is not a string where any character is a single-character delimiter. — dsh, Apr 12 '16 at 17:06
Usually I just split by space characters so I'm not quite sure how to do this (split by multiple delimiters). — Jobs, Apr 12 '16 at 17:08
Are you just going to split with non-word and non-underscore characters? Try [this solution](http://ideone.com/WpyPvU) — Wiktor Stribiżew, Apr 12 '16 at 17:28

score 4 · Accepted Answer · answered Apr 12 '16 at 17:04

4

re.split('\' .,()_', row)

This looks for the string ' .,()_ to split on. You probably meant

re.split('[\' .,()_]', row)

re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".

answered Apr 12 '16 at 17:04

Justin

24,288
12
92
142

Okay, when I do what you suggested, I get the following: ['', '', '', '', 'feature', 'append', 'freq', 'and', 'feature', 'text', '', 'freq', '', '\n'] – Jobs Apr 12 '16 at 17:11
this works, but needs a + added to prevent the empty strings in your results:[\' .,()_]+ – Scott Weaver Apr 12 '16 at 17:55
Oh I see. This indeed works and is what I ended up using. Thanks. – Jobs Apr 13 '16 at 05:43
@Jobs If this is what you used, perhaps you should accept it. But if another answer helped you more, please accept that. – Justin Apr 13 '16 at 15:48
Accepting this one because this was the first answer and so what I ended up using. All of the answers are very informative - thanks. – Jobs Apr 13 '16 at 16:09

Wiktor Stribiżew · Answer 2 · 2016-04-12T17:40:07.783

4

It seems you want to split a string with non-word or underscore characters. Use

import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']

See the IDEONE demo

The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.

You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).

edited Apr 12 '16 at 17:40

answered Apr 12 '16 at 17:32

Wiktor Stribiżew

607,720
39
448
563

1

I see. Thank you so much! – Jobs Apr 13 '16 at 05:47

rock321987 · Answer 3 · 2016-04-13T05:52:55.717

1

I think you are trying to split on the basis of non-word characters. It should be

re.split(r'[^A-Za-z0-9]+', s)

[^A-Za-z0-9] can be translated to --> [\W_]

Python Code

s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])

This will also work, indeed

p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))

Ideone Demo

edited Apr 13 '16 at 05:52

answered Apr 12 '16 at 17:40

rock321987

10,942
1
30
43

@Jobs You can use `re.findall()` too for this – rock321987 Apr 13 '16 at 05:48

cjahangir · Answer 4 · 2016-04-12T18:11:04.173

1

You can try this

str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str

This will result:

['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']

edited Apr 12 '16 at 18:11

answered Apr 12 '16 at 18:00

cjahangir

1,773
18
27

Python split with multiple delimiters not working

4 Answers4

You can try this