0

Looking for a way to recursively search a repository for all files containing a multi line string and return the file names that contain it. The paragraph is just a header approx 30 lines.

Below is the approach I am taking but is not working.

repo = os.getcwd()

header = """ /*
             /* .......paragraph
             /* ..............
             */
         """

for file in glob.glob(repo):
    with open(file) as f:
        contents = f.read()
    if header in contents:
        print file

I am getting this error:

IOError: [Errno 21] Is a directory: '/home/test/python/repos/projects/one'

Edited new version @zondo

def findAllFiles(directory):
    gen = os.walk(directory)
    next(gen)
    return [os.path.join(path, f) for path, _, files in gen for f in files]

def main():
    print "Searching directory for copyright header"
    for file in findAllFiles(repo):
        with open(file) as f:
            contents = f.read()
    if header in contents:
        print file
AlG
  • 14,697
  • 4
  • 41
  • 54
homeGrown
  • 375
  • 1
  • 8
  • 25
  • You've got error because trying to open directory, check whether it's `file` before (method `os.path.isfile`) – Andriy Ivaneyko Feb 03 '16 at 12:47
  • @andriy-ivaneyko But I don't want to individually open all files there are 100's of files. It is a git repository so there are multiple directories with many files – homeGrown Feb 03 '16 at 12:49
  • Quite similar yours question http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python – Andriy Ivaneyko Feb 03 '16 at 12:52

2 Answers2

1

With the os module, you can do this:

# Find not only all files in a folder, but all files in all sub-directories
def find_all_files(folder):
    return [os.path.join(path, f) for path, _, files in os.walk(folder) for f in files]

for file in find_all_files(repo):
    with open(file) as f:
        contents = f.read()
        if header in contents:
            print file
zondo
  • 19,901
  • 8
  • 44
  • 83
  • Doesnt return anything. Am I doing something wrong? Posted my implementation above – homeGrown Feb 03 '16 at 14:19
  • @homeGrown I can't help you there. Are you sure that the header is contained in at least one file? – zondo Feb 03 '16 at 14:23
  • yes it is in numerous files. Is my implementation posted in question correct yes? – homeGrown Feb 03 '16 at 14:26
  • @homeGrown You need to indent the `if header in contents:` and the line below it so that they are within the `with` block. – zondo Feb 03 '16 at 14:29
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/102489/discussion-between-homegrown-and-zondo). – homeGrown Feb 03 '16 at 14:35
0

Try using subprocess and pcregrep for matching multiple lines in different directories.

from subprocess import call
call(["pcregrep", "-rM","<regular_exp>","<path to directory>"])

Never tried this. Just came to my mind

sudheesh shetty
  • 358
  • 4
  • 14
  • To only search .c and .h files I used `call(["pcregrep", "-rM", "%s" % header,"--include=*.{c,h}", "%s" % repo])` But the results are coming out with pcregrep: check the --buffer-size option – homeGrown Feb 03 '16 at 13:16