-2

I am using re.sub to remove certain part of the text. there suppose to be multiple matches, but sub function only replace one occurrence per one execution. What is going on?

import re
import requests

r = requests.get('https://www.sec.gov/Archives/edgar/data/66740/000155837018000535/0001558370-18-000535.txt')
text = r.content.decode()
reg = re.compile('<DOCUMENT>\n<TYPE>(XML|GRAPHIC|ZIP|EXCEL|PDF)[\s\S]*?(</DOCUMENT>)')

re.findall(reg, text) 

``
output: [('GRAPHIC', '</DOCUMENT>'),
 ('GRAPHIC', '</DOCUMENT>'),
 ('XML', '</DOCUMENT>'),
 ('XML', '</DOCUMENT>'),...]
``

for i in range(10):
    text = re.sub(reg, '', text, re.MULTILINE)
    print(len(text))
``
output: 41875141
40950114
37558399
36097349
34776527``

In the first code block, I download the txt file and did a findall. there are multiple occurrence in this file. but when I use re.sub, it only replace one occurence.

EDIT

Seems that adding flag re.MULTILINE prevent the replace. Is there a way to get around?

JOHN
  • 871
  • 1
  • 12
  • 24
  • @EvgenyPogrebnyak I don't think so. if you do a findall() on his example, you only get one match. But in this example, I got multiple matches, but sub() is not working properly. – JOHN May 29 '18 at 21:28
  • You basically set `count` to a non-zero value, which prevents from sub() from replacing all occurences, I think – Evgeny May 29 '18 at 21:43
  • A @EvgenyPogrebnyak implied, it should be `flags=re.MULTILINE` in the `re.compile`, not the `re.sub`. – cdarke May 29 '18 at 21:45
  • with @cdarke: `text = re.sub(reg, '', text, count=0, flags=re.MULTILINE)` – Evgeny May 29 '18 at 21:46
  • 1
    @EvgenyPogrebnyak: I edited my comment, you can't specify flags in `re.sub` with a compiled RE. – cdarke May 29 '18 at 21:48
  • @EvgenyPogrebnyak: your suggestion will give this error: cannot process flags argument with a compiled pattern – JOHN May 29 '18 at 21:49
  • 1
    The prize goes to @cdarke ;) – Evgeny May 29 '18 at 21:50

1 Answers1

1

re.MULTILINE should be specified with the flags keyword. The position you chose happens to be the count parameter - the number of matches replaced (re.MULTILINE has the integer value 8).

However, with a compiled RE you cannot specify flags with re.sub but specify flags = re.MULTILINE in the re.compile instead.

reg = re.compile('<DOCUMENT>\n<TYPE>(XML|GRAPHIC|ZIP|EXCEL|PDF)[\s\S]*?(</DOCUMENT>)', flags=re.MULTILINE)
cdarke
  • 42,728
  • 8
  • 80
  • 84
  • Let alone the question being a dupe, why do you suggest to use `re.MULTILINE` with `'\n(XML|GRAPHIC|ZIP|EXCEL|PDF)[\s\S]*?()'` pattern that has neither `^` nor `$`? The real solution is to remove the argument altogether. – Wiktor Stribiżew May 29 '18 at 23:25