I am using re.sub to remove certain part of the text. there suppose to be multiple matches, but sub function only replace one occurrence per one execution. What is going on?
import re
import requests
r = requests.get('https://www.sec.gov/Archives/edgar/data/66740/000155837018000535/0001558370-18-000535.txt')
text = r.content.decode()
reg = re.compile('<DOCUMENT>\n<TYPE>(XML|GRAPHIC|ZIP|EXCEL|PDF)[\s\S]*?(</DOCUMENT>)')
re.findall(reg, text)
``
output: [('GRAPHIC', '</DOCUMENT>'),
('GRAPHIC', '</DOCUMENT>'),
('XML', '</DOCUMENT>'),
('XML', '</DOCUMENT>'),...]
``
for i in range(10):
text = re.sub(reg, '', text, re.MULTILINE)
print(len(text))
``
output: 41875141
40950114
37558399
36097349
34776527``
In the first code block, I download the txt file and did a findall. there are multiple occurrence in this file. but when I use re.sub, it only replace one occurence.
EDIT
Seems that adding flag re.MULTILINE prevent the replace. Is there a way to get around?