python regex sub does not replace all occurrence

Question

I am using re.sub to remove certain part of the text. there suppose to be multiple matches, but sub function only replace one occurrence per one execution. What is going on?

import re
import requests

r = requests.get('https://www.sec.gov/Archives/edgar/data/66740/000155837018000535/0001558370-18-000535.txt')
text = r.content.decode()
reg = re.compile('<DOCUMENT>\n<TYPE>(XML|GRAPHIC|ZIP|EXCEL|PDF)[\s\S]*?(</DOCUMENT>)')

re.findall(reg, text) 

``
output: [('GRAPHIC', '</DOCUMENT>'),
 ('GRAPHIC', '</DOCUMENT>'),
 ('XML', '</DOCUMENT>'),
 ('XML', '</DOCUMENT>'),...]
``

for i in range(10):
    text = re.sub(reg, '', text, re.MULTILINE)
    print(len(text))
``
output: 41875141
40950114
37558399
36097349
34776527``

In the first code block, I download the txt file and did a findall. there are multiple occurrence in this file. but when I use re.sub, it only replace one occurence.

EDIT

Seems that adding flag re.MULTILINE prevent the replace. Is there a way to get around?

@EvgenyPogrebnyak I don't think so. if you do a findall() on his example, you only get one match. But in this example, I got multiple matches, but sub() is not working properly. — JOHN, May 29 '18 at 21:28
You basically set `count` to a non-zero value, which prevents from sub() from replacing all occurences, I think — Evgeny, May 29 '18 at 21:43
A @EvgenyPogrebnyak implied, it should be `flags=re.MULTILINE` in the `re.compile`, not the `re.sub`. — cdarke, May 29 '18 at 21:45
with @cdarke: `text = re.sub(reg, '', text, count=0, flags=re.MULTILINE)` — Evgeny, May 29 '18 at 21:46
@EvgenyPogrebnyak: I edited my comment, you can't specify flags in `re.sub` with a compiled RE. — cdarke, May 29 '18 at 21:48
@EvgenyPogrebnyak: your suggestion will give this error: cannot process flags argument with a compiled pattern — JOHN, May 29 '18 at 21:49

cdarke · Accepted Answer · 2018-05-29T21:57:51.723

1

re.MULTILINE should be specified with the flags keyword. The position you chose happens to be the count parameter - the number of matches replaced (re.MULTILINE has the integer value 8).

However, with a compiled RE you cannot specify flags with re.sub but specify flags = re.MULTILINE in the re.compile instead.

reg = re.compile('<DOCUMENT>\n<TYPE>(XML|GRAPHIC|ZIP|EXCEL|PDF)[\s\S]*?(</DOCUMENT>)', flags=re.MULTILINE)

edited May 29 '18 at 21:57

answered May 29 '18 at 21:52

cdarke

42,728
8
80
84

Let alone the question being a dupe, why do you suggest to use `re.MULTILINE` with `'\n(XML|GRAPHIC|ZIP|EXCEL|PDF)[\s\S]*?()'` pattern that has neither `^` nor `$`? The real solution is to remove the argument altogether. – Wiktor Stribiżew May 29 '18 at 23:25

python regex sub does not replace all occurrence

1 Answers1