0

I have one project where I need to apply a dozen or so regex to about 100 files using python. 4+ hours of searching the web for various combinations including "(merge|concatenate|stack|join|compile) multiple regex in python" and I haven't found any posts regarding my need.

This is a mid-sized project for me. There are several smaller regex projects that I need which take only 5-6 regex patterns applied over only a dozen or so files. While these will be a great aid in my work, the grand-daddy project is a applying a file of 100+ search, replace strings to any new file I get. (Spelling conventions in certain languages are not standardized and being able to quick-process files will increase productivity.)

Ideally, the regex strings need to be update-able by a non programmer, but that maybe outside of the scope of this post.

Here is what I have so far:

import os, re, sys # Is "sys" necessary?

path = "/Users/mypath/testData"
myfiles = os.listdir(path)

for f in myfiles:

    # split the filename and file extension for use in renaming the output file
    file_name, file_extension = os.path.splitext(f)
    generated_output_file = file_name + "_regex" + file_extension

    # Only process certain types of files.
    if re.search("txt|doc|odt|htm|html")

    # Declare input and output files, open them, and start working on each line.
        input_file = os.path.join(path, f)
        output_file = os.path.join(path, generated_output_file)

        with open(input_file, "r") as fi, open(output_file, "w") as fo:
            for line in fi:

    # I realize that the examples are not regex, but they are in my real data.
    # The important thing, is that each of these is a substitution.
                line = re.sub(r"dog","cat" , line)
                line = re.sub(r"123", "789" , line)
                # Etc.

    # Obviously this doesn't work, because it is only writing the last instance of line.
                fo.write(line)
                fo.close()
BrianWilson
  • 1,653
  • 2
  • 12
  • 16
  • What's your question? What problem are you having with your code? – interjay Sep 23 '12 at 10:05
  • You want to make a program that allows the user to use advanced search criterias without having to create himself the regex? – Gabber Sep 23 '12 at 10:12
  • You should check `if file_extension in ["txt", "html", ...]` instead of doing re.search. But processing doc or odt files via regex will never work as these are not plain text documents: Old doc files are a proprietary binary format, newer docx files and odt are actually a zip file containing multiple xml files (the document itself, separate header / footer, style information etc), and even if you were to unzip it and process the right file, you could have arbitrary formatting syntax inbetween and even inside of words. – l4mpi Sep 23 '12 at 10:12
  • @interjay 1. I don't know how to apply multiple substitution regex to a file and write the results to a new file. 2. If the rest of my code is OK, then it will run the list of regex on all the files of chosen file type in my directory. – BrianWilson Sep 23 '12 at 11:00
  • @I4mpi. Thank you for pointing that out. Actually, I am planning on only processing txt and html files. I added the file type selection part to the code because earlier versions of my script were trying to run the regex on .DS_Store. This made me think that I didn't want any stray .mp3 or .jpg files being processed. I'll remove .doc and .odt from the search. // Please explain why to use if file_extension instead of re.search. – BrianWilson Sep 23 '12 at 11:19
  • @Gabber (Sorry for the truncated post earlier. I tried answering on an android tablet). Ideally the regex list would reside in a separate file. But at this stage, I just need basic functionality. Getting the script to actually process and write multiple files. – BrianWilson Sep 23 '12 at 11:47

1 Answers1

4

Is this what you're looking for?

Unfortunately you didn't specify how you know which regexes are supposed to be applied, so I put them into a list of tuples (first element is the regex, second is the replacement text).

import os, os.path, re

path = "/Users/mypath/testData"
myfiles = os.listdir(path)
# its much faster if you compile your regexes before you
# actually use them in a loop
REGEXES = [(re.compile(r'dog'), 'cat'),
           (re.compile(r'123'), '789')]
for f in myfiles:
    # split the filename and file extension for use in
    # renaming the output file
    file_name, file_extension = os.path.splitext(f)
    generated_output_file = file_name + "_regex" + file_extension

    # As l4mpi said ... if odt is zipped, you'd need to unzip it first
    # re.search is slower than a simple if statement
    if file_extension in ('.txt', '.doc', '.odt', '.htm', '.html'):

        # Declare input and output files, open them,
        # and start working on each line.
        input_file = os.path.join(path, f)
        output_file = os.path.join(path, generated_output_file)

        with open(input_file, "r") as fi, open(output_file, "w") as fo:
            for line in fi:
                for search, replace in REGEXES:
                    line = search.sub(replace, line)
                fo.write(line)
        # both the input and output files are closed automatically
        # after the with statement closes
Igor Serko
  • 549
  • 2
  • 12
  • `file_extension` starts with `.` unless empty – jfs Sep 23 '12 at 11:47
  • thank you, haven't used it in a while, fixed it now – Igor Serko Sep 23 '12 at 11:53
  • Thank you. This is very helpful. I want to apply all regexes in the list. I have basically three different projects. 1. Basic punctuation cleanup. 2. spelling substitutions for standardization of Lao text 3. Processing shortcuts for html tagging. When running your script, I do not get any errors, but nothing happens. No new *_regex.txt files are created. – BrianWilson Sep 23 '12 at 11:55
  • Could be because of the bug pointed out by J.F. Sebastian ... try copy/pasting it again – Igor Serko Sep 23 '12 at 11:57
  • Yes. The adding the . to the file_extension fixed the problem – BrianWilson Sep 23 '12 at 11:58
  • For testing purposes I made three files t.txt, t2.txt, and t3.txt each with identical text: The numbers in order are 0123456789 I like dogs. The first three letters of the alphabet are abc. I have had many dogs since I was a child. It is a "dog eat dog" world. The first three numbers are 123. abcdefg – BrianWilson Sep 23 '12 at 12:12
  • I applied 3 regex as per Igor Serko's code above. [(re.compile(r'dog'), 'cat'), (re.compile(r'123'), '789'), (re.compile(r'abc'), 'xyz')] – BrianWilson Sep 23 '12 at 12:12
  • with the results saved to t_regex.txt, t2_regex.txt and t3_regex.txt. The original files remained unchanged.: The numbers in order are 0789456789 I like cats. The first three letters of the alphabet are xyz. I have had many cats since I was a child. It is a "cat eat cat" world. The first three numbers are 789. xyzdefg So the code works as intended. Thank you. – BrianWilson Sep 23 '12 at 12:13