Script to find text strings between word A and word B. findstr, Regex, or alternatives

Question

I was recently given a large volume of email messages to analyze. Copies were converted into txt and html files and extracted into identical sub-directories. The data then sorted by field codes and entered into spreadsheets using various cmd/batch scripts. Later it became necessary to identify the file names of every attachment which created a problem.

findstr was able to successfully identify the path, email message, and file name of each attachment, which it saved to an output log, using the command below:

findstr /s Attachments: *.* >>Find_Attachments_Files2.txt

Unfortunately, findstr will only find the first file name after the word "Attachments:" and nothing more. I need to find, and log, the path, file, and every block of text between "Attachments:" and a 2nd marker, in this case, a series of dashes ("----"), and nothing beyond.

Text messages are similar to the format shown below and not limited to any fixed value/line #:

Attachments: Purely Practical.pdf  
Daily Revenue.xls  
Advertising_Ideas.doc

From: "Mouse, Mickey" Mickey.Mouse@mouseclick.com

The ability to capture blocks of text between marker1 and marker2 is enormously significant and a solution to this particular problem is a broader issue which should be framed accordingly. Although the search and replace function is of great value, the search and report function may be the greatest value of all.

What makes this so imperfect and difficult? Any suggestions or reliable solutions?

As a matter of fact, Python was the only way I was able to extract the original Outlook .msg messages to any other usable format. Bottom line: extremely impressed by the capabilities of Python but simply lack the experience. — UberGeek, Aug 22 '15 at 07:16
Excuse me. Have you _many files_ each one with _one_ section like the shown? Or have you _a single_ large file with several sections like the shown? Any other possibility? — Aacini, Aug 22 '15 at 18:15
@Aacini Sorry I didn't see this earlier, it's very important. Many files - Many subdirectories, none deeper than 3 levels - Path must precede each file name, followed by Attachment file name, preferably separated by a tab. Html files are formatted like the example above. Txt files are similar, except "Headers:" are replaced by "-----" (very long string of dashes). Output should look similar to the findstr command in my question as follows: R:\Emails\Inbox\000023Estimated revenue questions.txt Estimated Revenue_Mar_08.xls (29,232) — UberGeek, Aug 23 '15 at 15:50
You changed the question. Giving false details to get exact programming concepts and code should be made a hanging offense. — foxidrive, Aug 24 '15 at 08:30
Frustration, guilt, and now a hang'in! By adding clarity to my original question I never intended to be deceitful. I thought we were working together, on a solution. My apologies. — UberGeek, Aug 24 '15 at 08:57

score 1 · Answer 1 · answered Dec 27 '15 at 19:13

I am not sure if this will work and it is only a theory but I think that it is worth giving a try.

I think that there is a possibly that the findstr command gives out an errorlevel after it finished executing. If the errorlevel is different for when it finds the string. And a different errorlevel if it doesn't find the string.

If this does work then you can do something similar to a while loop e.g.

:A  
findstr :: And then the full command  
if errorlevel == 1 goto A :: If the string has been found  
goto B :: The rest of your code

this is only in theory

To save the output you should be able to do something like this echo command >>log.txt :: this will save the output of the command into the text file called log.

score 0 · Answer 2 · edited May 23 '17 at 12:15

My take on this, from high to low level...

Why is this imperfect and difficult? Because, despite the fact you were diligent enough to improve the question, it still leaves a lot undefined, yet it is already rather complicated. Luckily, others have explored text files likewise, and whole programming languages have been developed to deal with that. But even when you've learned some of them thoroughly, you still get bitten, because the computers acting on your specification are mind-blowingly stupid. Fast, but stupid.

Using something out of the box like findstr, egrep... to deal with this particular problem seems next to impossible to me. A programming language like Python is a much more viable and future-proof match.

So then the programming task has two parts:

Walking a directory tree to visit each file
Finding the list in each file's contents

As to the latter, regular expressions do look like a viable mechanism, but the first question is, can you afford them? Clearly we need multi-line processing, and whenever I've seen that done, it was on entire files at once. Can you afford to read an entire file into memory? Can you afford to read an entire file from disk at all - perhaps the headers are on top of the files and reading the entire body is wasteful? I'll assume there is no problem.

Using a single regular expression to extract individual attachment names directly from the files seems very complicated (even in a language that supports repeating captures). So I would let a regular expression find the the list first, then split it up. Not even considering whatever it is what you meant with .txt files, and with too few test case stashed in, that brings us to:

import os
import re

searcher = re.compile(r"^Attachments: (.+?)^---+$", flags=re.MULTILINE+re.DOTALL)

def visitFile(filepath, out):
    with open(filepath) as f:
        match = searcher.search(f.read())
        if match:
            for name in match.group(1).split('\n')[:-1]:
                out.write("%s\t%s\n" % (filepath, name))

def visitFolder(topdirpath, out):
    for dirpath, subdirnames, filenames in os.walk(topdirpath):
        subdirnames.sort() # if needed
        filenames.sort() # if needed
        for filename in filenames:
            visitFile(os.path.join(dirpath, filename), out)

if __name__ == "main":
    visitFolder(sys.argv[1], sys.out)

import io
import tempfile
import unittest

class FolderBasedTestCase(unittest.TestCase):
    def setUp(self):
        self.tempdir = tempfile.TemporaryDirectory(prefix="test_dir_")
        self.out = io.StringIO()
    def tearDown(self):
        self.tempdir.cleanup()
        self.out.close()
    def walkthewalk(self):
        visitFolder(self.tempdir.name, self.out)

class EmptyFolderTestCase(FolderBasedTestCase):
    def runTest(self):
        self.walkthewalk()
        self.assertEqual(self.out.getvalue(), "")

class FriendTestCase(FolderBasedTestCase):
    def setUp(self):
        super().setUp()
        with open(os.path.join(self.tempdir.name, "friend"), "w") as f:
            f.write("Some: Stuff\n" +
                    "Attachments: Purely Practical.pdf\n" +
                    "Daily Revenue.xls\n" +  
                    "Advertising_Ideas.doc\n" +
                    "-------------\n" +
                    'From: "Mouse, Mickey" Mickey.Mouse@mouseclick.com\n')
    def runTest(self):
        self.walkthewalk()
        self.assertEqual(self.out.getvalue().replace(self.tempdir.name + os.sep, "{p}"), 
            "{p}friend\tPurely Practical.pdf\n" +
            "{p}friend\tDaily Revenue.xls\n" +
            "{p}friend\tAdvertising_Ideas.doc\n")

class FooTestCase(FolderBasedTestCase):
    def setUp(self):
        super().setUp()
        with open(os.path.join(self.tempdir.name, "foo"), "w") as f:
            f.write("From: your worst enemy\n" +
                    "\n" +
                    "Mail body here. This week's topics:\n" +
                    "Attachments: are't they a pain?\n" +
                    "Pain: don't we get attached to it?\n" +
                    "\n")
    def runTest(self):
        self.walkthewalk()
        self.assertEqual(self.out.getvalue(),  "")

Beware that the regular expression is (hopefully) independent of the file's linefeed flavour, but, as it stands, the split() requires the right line separator.

I doubt that compiling and storing the regular expression separately has any performance benefit that anyone would ever notice, but I think for this small amount of code, it actually makes things more readable.

To run the unit tests, in particular if you store both code and test cases in a single file scriptname.py, do `python -m unittest scriptname'.

Script to find text strings between word A and word B. findstr, Regex, or alternatives

2 Answers2