0

I have over 200,000 txt files containing data I need to extract, such as address, name, and amount paid. With the size of the project and the complexity of the data I need to extract, what is the best way to do this?

I am currently trying to use the regex module to search each file for the relevant info one by one. This is what I have:

BBL_raw = re.compile(r'''
    Borough,\s+[Bb]lock\s+&\s+[Ll]ot\:\s+\w+\s+\((\d)\),\s+(\d{5}),\s+(\d{4})\s+
    ''', re.VERBOSE)

BBLs = []

for filename in filepaths:
    with open(filename, 'r') as readit:
        readfile = readit.read().replace('\n','')
        bblsearch = BBL_raw.search(readfile)
        tup = bblsearch.groups()
        string = '\\'.join(tup)
        BBLs.append(string)

I can imagine that this will be incredibly tedious and take a very long time to run if I want to scan all 250,000 files. I'm not even sure if this is possible. I also have a reference document here below but being fairly new to Python I am having trouble understanding it and adapting it to my uses.

https://github.com/talos/nyc-stabilization-unit-counts/blob/master/parse.py

yc3200
  • 1

1 Answers1

0

I would use pandas to manage the data, you can check it here:

https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

As for extraction of the files, you can run multiple threads to try speeding it up. But remember, there is an overhead by creating threads. Also, as reading is I/O based, it can end up slowing the process.

Check more about threading here: https://docs.python.org/3/library/threading.html

Another problem about the use of Threads on Python is about the GIL, check a reference about GIL: https://docs.python.org/3/c-api/init.html#thread-state-and-the-global-interpreter-lock

Reading Mike McKerns' solution may also help you: https://stackoverflow.com/a/28613077/10473393

Alexander Santos
  • 1,458
  • 11
  • 22