I have over 200,000 txt files containing data I need to extract, such as address, name, and amount paid. With the size of the project and the complexity of the data I need to extract, what is the best way to do this?
I am currently trying to use the regex module to search each file for the relevant info one by one. This is what I have:
BBL_raw = re.compile(r'''
Borough,\s+[Bb]lock\s+&\s+[Ll]ot\:\s+\w+\s+\((\d)\),\s+(\d{5}),\s+(\d{4})\s+
''', re.VERBOSE)
BBLs = []
for filename in filepaths:
with open(filename, 'r') as readit:
readfile = readit.read().replace('\n','')
bblsearch = BBL_raw.search(readfile)
tup = bblsearch.groups()
string = '\\'.join(tup)
BBLs.append(string)
I can imagine that this will be incredibly tedious and take a very long time to run if I want to scan all 250,000 files. I'm not even sure if this is possible. I also have a reference document here below but being fairly new to Python I am having trouble understanding it and adapting it to my uses.
https://github.com/talos/nyc-stabilization-unit-counts/blob/master/parse.py