0

first of all: I know you shouldn't use RegEx to parse HTML because there are a lot of good Parser outside there like BS4 oder lxml. However as I don't have well written HTML files I cannot search for a tag or something else to extract my HTML text.

My code is working for some part the way I want it to but it seems as if it stops randomly after it matches my RegEx.

import glob
import os
import re
import contextlib


@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()


def extractor():
    os.chdir(r"F:\Test")  # the directory containing html
    with stdout2file("Test.txt"):
        for file in glob.iglob("*.html"):  # iterates over all files in the directory ending in .html
            with open(file, encoding="utf8") as f:
                contents = f.read()
                regex = re.compile(r'The interesting part begins.*?and for sure there is no other reason', re.S)
                extract = regex.search(contents)
                if re.search(regex, contents) is not None:
                    print(file.split(os.path.sep)[-1], end="| ")
                    print(extract)

extractor()

My output that I get is:

Thisismyfirsthtmlfile.html| <_sre.SRE_Match object; span=(75794, 93039), match='The interesting part begins with the notes of Secul>

So it stops suddenly within the word Secul (note the complete word would be Secular). After trying it for a few hours with different solutions, I cannot figure out what I might have missed?

Any idea on that?

Florian Schramm
  • 333
  • 3
  • 15

0 Answers0