0

i am working on a project where i have a raw data that i need to extract from each txt file (around 300 000) only once and move to another batch of 300 000 files. And there for i need to open each txt file one after the other in a effetion way posible to minimize the time of this process. I'm using open() but it can take up to 10-15 min for avg 160 000 txt files no more than 600 bytes each.

Thank you for your time :)

for filename in os.listdir("folder1"):

    with open(os.path.join("folder1", filename), 'r') as f:
        text = f.read()
        text = re.findall(r'\w+', text)
        index = 0

        while index < len(text):
            if text[index] == "P1":
                function1(text)

            elif text[index] == "T1":
                function2(text)

            index += 1
  • You can use the multiprocessing library in python to split the work up between CPU cores. [Multiprocessing Library](https://docs.python.org/3/library/multiprocessing.html). If you need to maintain concurrency, you can also explore [Threading](https://docs.python.org/3/library/threading.html) – Lateralus Oct 12 '22 at 12:22
  • 2
    @Lateralus: chances are that this will slow down the whole process due to simultaneous disk accesses. –  Oct 12 '22 at 12:23
  • @YvesDaoust Hii, thank you for your answer. I am using an SSD already and a intel core i5 – Youssef Sabaa Oct 12 '22 at 12:33
  • Here is a link to similar question resolved using multiprocessing: https://stackoverflow.com/a/36590187/10151980 – Lateralus Oct 12 '22 at 12:59

0 Answers0