4

I have a string:

feature.append(freq_and_feature(text, freq))

I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.

These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):

import re
with open("helper.txt", "r") as helper:
    for row in helper:

       print re.split('\' .,()_', row)

However, I get the following, which is not what I want.

['    feature.append(freq_pain_feature(text, freq))\n']
Community
  • 1
  • 1
Jobs
  • 3,317
  • 6
  • 26
  • 52
  • 1
    Probably because the string described by the pattern doesn't exist. read a tutorial before, ask after. – Casimir et Hippolyte Apr 12 '16 at 17:05
  • The first argument to `re.split()` must be a correct [regular expression](https://docs.python.org/3/library/re.html#regular-expression-syntax) that matches the delimiter. It is not a string where any character is a single-character delimiter. – dsh Apr 12 '16 at 17:06
  • Can I do this instead with the regular str.split(), then? – Jobs Apr 12 '16 at 17:07
  • Usually I just split by space characters so I'm not quite sure how to do this (split by multiple delimiters). – Jobs Apr 12 '16 at 17:08
  • Are you just going to split with non-word and non-underscore characters? Try [this solution](http://ideone.com/WpyPvU) – Wiktor Stribiżew Apr 12 '16 at 17:28

4 Answers4

4
re.split('\' .,()_', row)

This looks for the string ' .,()_ to split on. You probably meant

re.split('[\' .,()_]', row)

re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".

Justin
  • 24,288
  • 12
  • 92
  • 142
  • Okay, when I do what you suggested, I get the following: ['', '', '', '', 'feature', 'append', 'freq', 'and', 'feature', 'text', '', 'freq', '', '\n'] – Jobs Apr 12 '16 at 17:11
  • this works, but needs a + added to prevent the empty strings in your results:[\' .,()_]+ – Scott Weaver Apr 12 '16 at 17:55
  • Oh I see. This indeed works and is what I ended up using. Thanks. – Jobs Apr 13 '16 at 05:43
  • @Jobs If this is what you used, perhaps you should accept it. But if another answer helped you more, please accept that. – Justin Apr 13 '16 at 15:48
  • Accepting this one because this was the first answer and so what I ended up using. All of the answers are very informative - thanks. – Jobs Apr 13 '16 at 16:09
4

It seems you want to split a string with non-word or underscore characters. Use

import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']

See the IDEONE demo

The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.

You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

I think you are trying to split on the basis of non-word characters. It should be

re.split(r'[^A-Za-z0-9]+', s)

[^A-Za-z0-9] can be translated to --> [\W_]

Python Code

s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])

This will also work, indeed

p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))

Ideone Demo

rock321987
  • 10,942
  • 1
  • 30
  • 43
1

You can try this

str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str

This will result:

['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
cjahangir
  • 1,773
  • 18
  • 27