4

I am trying to run below script on all *.txt files in current directory. Currently it will process only test.txt file and print block of text based on regular expression. What would be the quickest way of scanning current directory for *.txt files and running below script on all found *.txt files? Also how I could include lines containing 'word1' and 'word3' as currently script is printing only content between those two lines? I would like to print whole block.

#!/usr/bin/env python
import os, re
file = 'test.txt'
with open(file) as fp:
   for result in re.findall('word1(.*?)word3', fp.read(), re.S):
     print result

I would appreciate any advice or suggestions on how to improve above code e.g. speed when running on large set of text files. Thank you.

user3066287
  • 43
  • 1
  • 4
  • 3
    Very closely related: [Find all files in directory with extension .txt with python](http://stackoverflow.com/q/3964681/710446) – apsillers Dec 04 '13 at 14:54
  • @apsillers thanks for your input, I saw this one however wasn't sure which solution is optimal...? – user3066287 Dec 04 '13 at 15:08

2 Answers2

6

Use glob.glob:

import os, re
import glob

pattern = re.compile('word1(.*?)word3', flags=re.S)
for file in glob.glob('*.txt'):
    with open(file) as fp:
        for result in pattern.findall(fp.read()):
            print result
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • Is there any advantage in comparison to below example: – user3066287 Dec 04 '13 at 15:13
  • 1
    @user3066287, Both version almost same. – falsetru Dec 04 '13 at 15:14
  • Is there any advantage in comparison to below example: import os for root, dirs, files in os.walk("/mydir"): for file in files: if file.endswith(".txt"): print os.path.join(root, file) – user3066287 Dec 04 '13 at 15:21
  • @user3066287, `glob.glob('*.txt')` only find `txt` file inside the current directory, while `os.walk` verison you commented also find in subdirectories recursively. – falsetru Dec 04 '13 at 15:22
  • thank you for your input, would you have any advise regarding second part of my question please? – user3066287 Dec 04 '13 at 15:29
  • @user3066287, Compiling the regular expression using `re.compile` will slightly speed up, but not much. I updated the answer to use `re.compile`. – falsetru Dec 04 '13 at 15:32
  • thank you, with large collection of txt files it would make big difference, am I correct? – user3066287 Dec 04 '13 at 15:53
  • @user3066287, Compiled versions of most recent regular expressions are cached. So difference will not be big. – falsetru Dec 04 '13 at 16:11
0

Inspired by the answer of falsetru, I rewrote my code, making it more generic.

Now the files to explore :

  • can be described either by a string as second argument that will be used by glob(),
    or by a function specifically written for this goal in case the set of desired files can't be described with a globish pattern

  • and may be in the current directory if no third argument is passed,
    or in a specified directory if its path is passed as a second argument

.

import re,glob
from itertools import ifilter
from os import getcwd,listdir,path
from inspect import isfunction

regx = re.compile('^[^\n]*word1.*?word3.*?$',re.S|re.M)

G = '\n\n'\
    'MWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMW\n'\
    'MWMWMW  %s\n'\
    'MWMWMW  %s\n'\
    '%s%s'

def search(REGX, how_to_find_files, dirpath='',
           G=G,sepm = '\n======================\n'):
    if dirpath=='':
        dirpath = getcwd()

    if isfunction(how_to_find_files):
        gen = ifilter(how_to_find_files,
                      ifilter(path.isfile,listdir(dirpath)))
    elif isinstance(how_to_find_files,str):
        gen = glob.glob(path.join(dirpath,
                                  how_to_find_files))

    for fn in gen:
        with open(fn) as fp:
            found = REGX.findall(fp.read())
            if found:
                yield G % (dirpath,path.basename(fn),
                           sepm,sepm.join(found))

# Example of searching in .txt files

#============ one use ===================
def select(fn):
    return fn[-4:]=='.txt'
print ''.join(search(regx, select))

#============= another use ==============
print ''.join(search(regx,'*.txt'))

The advantage of chaining the treatments of sevral files through succession of generators is that the final joining with ''.join() creates a unique string that is instantly written,
while, if not so processed, the printing of several individual strings one after the other is longer because of the interrupts of displaying (am I understandable ?)

eyquem
  • 26,771
  • 7
  • 38
  • 46