0

I'm getting weird results when I use the re.DOTALL in re.finditer() when using Python 3.6. I don't know if this is the expected operation or If I'm missing something or if its a bug.

CASE 1

I try this version of a string with an embedded newline.

I expect to get 2 matched values back: m1 = 'abc' and m2 = ' de'

import re
result = re.finditer('.*', 'abc\n de', flags=0)
m1 = result.__next__()
#    <_sre.SRE_Match object; span=(0, 3), match='abc'>
m2 = result.__next__()
#    <_sre.SRE_Match object; span=(3, 3), match=''>
m3 = result.__next__()
#    <_sre.SRE_Match object; span=(4, 7), match=' de'>
m4 = result.__next__()
#    <_sre.SRE_Match object; span=(7, 7), match=''>

Whats with the match values m2 and m4?

CASE 2

I try this with re.DOTALL, and I expect to get back one match, m1 = 'abc\n de'

result = re.finditer('.*', 'abc\n de', flags=re.DOTALL)
m1 = result.__next__()
#     <_sre.SRE_Match object; span=(0, 7), match='abc\n de'>
m2 = result.__next__()
#     <_sre.SRE_Match object; span=(7, 7), match=''>

Whats with the extra matches? How do I make the results work as expected?

I want the first case to return ...

m1 = 'abc'
m2 = ' de'

... and the second case to return

m1 = 'abc\n de'

and nothing else.

P Moran
  • 1,624
  • 3
  • 18
  • 32

1 Answers1

1

Your pattern is

.*

This means "match zero or more characters"; zero-width matches are permitted.

In your first case, the m2 and m4s exist because the pattern stops matching at the newline, then tries to find a new match starting at that position (index 3). No characters are matched, but the pattern still permits it, because it's .*, hence the first match has

span=(0, 3)

and the second match has

span=(3, 3)

The same thing is happening for span=(7, 7) in m4 and in your DOTALL code.

It sounds like you want a match only if there's at least one character - repeat with + rather than *:

re.finditer('.+', 'abc\n de')
CertainPerformance
  • 356,069
  • 52
  • 309
  • 320