first of all: I know you shouldn't use RegEx to parse HTML because there are a lot of good Parser outside there like BS4 oder lxml. However as I don't have well written HTML files I cannot search for a tag or something else to extract my HTML text.
My code is working for some part the way I want it to but it seems as if it stops randomly after it matches my RegEx.
import glob
import os
import re
import contextlib
@contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
def extractor():
os.chdir(r"F:\Test") # the directory containing html
with stdout2file("Test.txt"):
for file in glob.iglob("*.html"): # iterates over all files in the directory ending in .html
with open(file, encoding="utf8") as f:
contents = f.read()
regex = re.compile(r'The interesting part begins.*?and for sure there is no other reason', re.S)
extract = regex.search(contents)
if re.search(regex, contents) is not None:
print(file.split(os.path.sep)[-1], end="| ")
print(extract)
extractor()
My output that I get is:
Thisismyfirsthtmlfile.html| <_sre.SRE_Match object; span=(75794, 93039), match='The interesting part begins with the notes of Secul>
So it stops suddenly within the word Secul (note the complete word would be Secular). After trying it for a few hours with different solutions, I cannot figure out what I might have missed?
Any idea on that?