1

I'm trying to split string using regular expression with python and get all the matched literals.

RE: \w+(\.?\w+)*

this need to capture [a-zA-Z0-9_] like stuff only.

Here is example

but when I try to match and get all the contents from string, it doesn't return proper results.

Code snippet:

>>> import re
>>> from pprint import pprint
>>> pattern = r"\w+(\.?\w+)*"
>>> string = """this is some test string and there are some digits as well that need to be captured as well like 1234567890 and 321 etc. But it should also select _ as well. I'm pretty sure that that RE does exactly the same.
... Oh wait, it also need to filter out the symbols like !@#$%^&*()-+=[]{}.,;:'"`| \(`.`)/
... 
... I guess that's it."""
>>> pprint(re.findall(r"\w+(.?\w+)*", string))
[' etc', ' well', ' same', ' wait', ' like', ' it']

it's only returning some of words, but actually it should return all the words, numbers and underscore(s)[as in linked example].

python version: Python 3.6.2 (default, Jul 17 2017, 16:44:45)

Thanks.

Community
  • 1
  • 1
Mubin
  • 4,325
  • 5
  • 33
  • 55
  • 3
    Use `re.findall(r"\w+(?:\.?\w+)*", string)`. If you only need ASCII, pass `re.A` flag so that `\w` only matched ASCII letters and digits. See [demo](https://ideone.com/2sLrjV). If you need to only match letters, replace `\w` with `[^\W\d_]`. Note what you wrote at the beginning is different from what you used in code. – Wiktor Stribiżew Sep 02 '17 at 18:03
  • great, thanks. I've used the same re(`\w+(.?\w+)*`) with `java` and it works fine, can you please point out the difference as well, that will be great. – Mubin Sep 02 '17 at 18:08
  • Well, you must escape the dot and use a non-capturing group. You do not need the outer capturing parentheses. – Wiktor Stribiżew Sep 02 '17 at 18:09
  • `re.findall('\w+', string)` works as expected, for me. – Zach Gates Sep 02 '17 at 18:11
  • thanks a million @WiktorStribiżew, you're awesome. – Mubin Sep 02 '17 at 18:11

1 Answers1

3

You need to use a non-capturing group (see here why) and escape the dot (see here what chars should be escaped in regex):

>>> import re
>>> from pprint import pprint
>>> pattern = r"\w+(?:\.?\w+)*"
>>> string = """this is some test string and there are some digits as well that need to be captured as well like 1234567890 and 321 etc. But it should also select _ as well. I'm pretty sure that that RE does exactly the same.
... Oh wait, it also need to filter out the symbols like !@#$%^&*()-+=[]{}.,;:'"`| \(`.`)/
... 
... I guess that's it."""
>>> pprint(re.findall(pattern, string, re.A))
['this', 'is', 'some', 'test', 'string', 'and', 'there', 'are', 'some', 'digits', 'as', 'well', 'that', 'need', 'to', 'be', 'captured', 'as', 'well', 'like', '1234567890', 'and', '321', 'etc', 'But', 'it', 'should', 'also', 'select', '_', 'as', 'well', 'I', 'm', 'pretty', 'sure', 'that', 'that', 'RE', 'does', 'exactly', 'the', 'same', 'Oh', 'wait', 'it', 'also', 'need', 'to', 'filter', 'out', 'the', 'symbols', 'like', 'I', 'guess', 'that', 's', 'it']

Also, to only match ASCII letters, digits and _ you must pass re.A flag.

See the Python demo.

Zach Gates
  • 4,045
  • 1
  • 27
  • 51
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563