Print RegEx match to file Python

Question

first of all: I know you shouldn't use RegEx to parse HTML because there are a lot of good Parser outside there like BS4 oder lxml. However as I don't have well written HTML files I cannot search for a tag or something else to extract my HTML text.

My code is working for some part the way I want it to but it seems as if it stops randomly after it matches my RegEx.

import glob
import os
import re
import contextlib


@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()


def extractor():
    os.chdir(r"F:\Test")  # the directory containing html
    with stdout2file("Test.txt"):
        for file in glob.iglob("*.html"):  # iterates over all files in the directory ending in .html
            with open(file, encoding="utf8") as f:
                contents = f.read()
                regex = re.compile(r'The interesting part begins.*?and for sure there is no other reason', re.S)
                extract = regex.search(contents)
                if re.search(regex, contents) is not None:
                    print(file.split(os.path.sep)[-1], end="| ")
                    print(extract)

extractor()

My output that I get is:

Thisismyfirsthtmlfile.html| <_sre.SRE_Match object; span=(75794, 93039), match='The interesting part begins with the notes of Secul>

So it stops suddenly within the word Secul (note the complete word would be Secular). After trying it for a few hours with different solutions, I cannot figure out what I might have missed?

Any idea on that?

looks like the `repr(match)` is being cut off, probably it *has* matched everything you wanted - the giveaway is that it says `match='...` without a closing single quote before the `>`. — Keith Hall, Jul 28 '16 at 06:07
Change `if re.search(regex, contents) is not None:` with `if extract:` and access the value you need via `extract.group()`. — Wiktor Stribiżew, Jul 28 '16 at 06:10
The result of a call to `regex.search` isn’t a string, but an `SRE_Match` object. You want `extract.group(0)`. — 2ps, Jul 28 '16 at 06:11
@2ps: Thanks for your solution, that has been the problem! Could you give me a short explanation how group(0) works? — Florian Schramm, Jul 28 '16 at 06:20
@FlorianSchramm: [it works alright](https://ideone.com/NFfW89). — Wiktor Stribiżew, Jul 28 '16 at 06:22
The `group()` accesses the *match value*. `group(0)` does the same. See [`re`](https://docs.python.org/2/library/re.html#re.MatchObject.group) reference: *Without arguments, `group1` defaults to zero (the whole match is returned)*. — Wiktor Stribiżew, Jul 28 '16 at 06:32
@WiktorStribiżew: Sorry, you're right. I missed to extend extract with .Group(). So yes, your solution also works fine! Thanks for your short explanation on group(). — Florian Schramm, Jul 28 '16 at 07:20

Print RegEx match to file Python

0 Answers0