My take on this, from high to low level...
Why is this imperfect and difficult? Because, despite the fact you were diligent enough to improve the question, it still leaves a lot undefined, yet it is already rather complicated. Luckily, others have explored text files likewise, and whole programming languages have been developed to deal with that. But even when you've learned some of them thoroughly, you still get bitten, because the computers acting on your specification are mind-blowingly stupid. Fast, but stupid.
Using something out of the box like findstr, egrep... to deal with this particular problem seems next to impossible to me. A programming language like Python is a much more viable and future-proof match.
So then the programming task has two parts:
- Walking a directory tree to visit each file
- Finding the list in each file's contents
As to the latter, regular expressions do look like a viable mechanism, but the first question is, can you afford them? Clearly we need multi-line processing, and whenever I've seen that done, it was on entire files at once. Can you afford to read an entire file into memory? Can you afford to read an entire file from disk at all - perhaps the headers are on top of the files and reading the entire body is wasteful? I'll assume there is no problem.
Using a single regular expression to extract individual attachment names directly from the files seems very complicated (even in a language that supports repeating captures). So I would let a regular expression find the the list first, then split it up. Not even considering whatever it is what you meant with .txt files, and with too few test case stashed in, that brings us to:
import os
import re
searcher = re.compile(r"^Attachments: (.+?)^---+$", flags=re.MULTILINE+re.DOTALL)
def visitFile(filepath, out):
with open(filepath) as f:
match = searcher.search(f.read())
if match:
for name in match.group(1).split('\n')[:-1]:
out.write("%s\t%s\n" % (filepath, name))
def visitFolder(topdirpath, out):
for dirpath, subdirnames, filenames in os.walk(topdirpath):
subdirnames.sort() # if needed
filenames.sort() # if needed
for filename in filenames:
visitFile(os.path.join(dirpath, filename), out)
if __name__ == "main":
visitFolder(sys.argv[1], sys.out)
import io
import tempfile
import unittest
class FolderBasedTestCase(unittest.TestCase):
def setUp(self):
self.tempdir = tempfile.TemporaryDirectory(prefix="test_dir_")
self.out = io.StringIO()
def tearDown(self):
self.tempdir.cleanup()
self.out.close()
def walkthewalk(self):
visitFolder(self.tempdir.name, self.out)
class EmptyFolderTestCase(FolderBasedTestCase):
def runTest(self):
self.walkthewalk()
self.assertEqual(self.out.getvalue(), "")
class FriendTestCase(FolderBasedTestCase):
def setUp(self):
super().setUp()
with open(os.path.join(self.tempdir.name, "friend"), "w") as f:
f.write("Some: Stuff\n" +
"Attachments: Purely Practical.pdf\n" +
"Daily Revenue.xls\n" +
"Advertising_Ideas.doc\n" +
"-------------\n" +
'From: "Mouse, Mickey" Mickey.Mouse@mouseclick.com\n')
def runTest(self):
self.walkthewalk()
self.assertEqual(self.out.getvalue().replace(self.tempdir.name + os.sep, "{p}"),
"{p}friend\tPurely Practical.pdf\n" +
"{p}friend\tDaily Revenue.xls\n" +
"{p}friend\tAdvertising_Ideas.doc\n")
class FooTestCase(FolderBasedTestCase):
def setUp(self):
super().setUp()
with open(os.path.join(self.tempdir.name, "foo"), "w") as f:
f.write("From: your worst enemy\n" +
"\n" +
"Mail body here. This week's topics:\n" +
"Attachments: are't they a pain?\n" +
"Pain: don't we get attached to it?\n" +
"\n")
def runTest(self):
self.walkthewalk()
self.assertEqual(self.out.getvalue(), "")
Beware that the regular expression is (hopefully) independent of the file's linefeed flavour, but, as it stands, the split()
requires the right line separator.
I doubt that compiling and storing the regular expression separately has any performance benefit that anyone would ever notice, but I think for this small amount of code, it actually makes things more readable.
To run the unit tests, in particular if you store both code and test cases in a single file scriptname.py, do `python -m unittest scriptname'.