0

I am writing some scripts to process some text files in python. Locally the script reads from a single txt file thus i use

index_file =  open('index.txt', 'r')
    for line in index_file:
       ....

and loop through the file to find a matching string, but when using amazon EMR, the index.txt file per se, is split into multiple txt files in a single folder.

Thus i would like to replicate that locally and read from multiple txt file for a certain string, but i struggle to find clean code to do that.

What is the best way to go about it while writing minimal code?

Petros Kyriakou
  • 5,214
  • 4
  • 43
  • 82
  • You can use os.walk to get all the files in the directory, loop through them, and then apply your matching string logic for each file. – pmaniyan May 11 '16 at 15:11

1 Answers1

2
import os
from glob import glob

def readindex(path):
    pattern = '*.txt'
    full_path = os.path.join(path, pattern)
    for fname in sorted(glob(full_path)):
        for line in open(fname, 'r'):
            yield line
# read lines to memory list for using multiple times
linelist = list(readindex("directory"))
for line in linelist:
    print line,

This script defines a generator (see this question for details about generators) to iterate through all the files in directory "directory" that have extension "txt" in sorted order. It yields all the lines as one stream that after calling the function can be iterated through as if the lines were coming from one open file, as that seems to be what the question author wanted. The comma at the end of print line, makes sure that newline is not printed twice, although the content of the for loop would be replaced by question author anyway. In that case one can use line.rstrip() to get rid of the newline.

The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.

Community
  • 1
  • 1
emh
  • 199
  • 10
  • While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. – Uyghur Lives Matter May 11 '16 at 16:45
  • Hi emh, how do i specify the directory using the above code? I agree with cpburnz please add more info how it works as i can't make much out of it. – Petros Kyriakou May 11 '16 at 16:57
  • @emh is there anyway to make this more efficient? its really slow, as what i really want is make one stream as you said here, but i want to loop into that stream for each line i get from `sys.stdin` (coming from another script) thus it gets very slow is there a better/faster way to do it? – Petros Kyriakou May 11 '16 at 17:19
  • @PetrosKyriakou I edited the answer to read the lines to a list in memory. Does that make it fast enough? – emh May 11 '16 at 17:26
  • @zondo Yes for one iteration it makes it slower, but as he says he needs to iterate multiple times, for each line of stdin. – emh May 11 '16 at 17:37
  • @emh thanks i tuned my code up a bit its fast enough to test locally without the full set of .txt files. EMR should be ok with analysing the whole set. – Petros Kyriakou May 11 '16 at 21:01