7

I am currently pulling .txt files from the path list of FileNameList, which is working. But my main problem is, it is too slow when the files is too many.

I am using this code to print list of txt files,

import os
import sys

#FileNameList is my set of files from my path
for filefolder in FileNameList: 
  for file in os.listdir(filefolder): 
    if "txt" in file:
        filename = filefolder + "\\" + file     
        print filename

Any help or suggestion to have thread/multiprocess and make it fast reading will accept. Thanks in advance.

Syntax Rommel
  • 932
  • 2
  • 16
  • 40
  • 2
    Multithreading/multiprocessing is not going to speed this up; your bottleneck is the storage device. – Cyphase Aug 11 '15 at 06:28
  • Is it possible to use another filetype than .txt or to convert to another filetype in advance and then have the speedup when you need them? Then a massive speedup would be possible. – HeinzKurt Aug 11 '15 at 07:40
  • What do you mean @HeinzKurt? – Cyphase Aug 11 '15 at 09:49
  • I want to know if it is possible to replace the .txt filetype by another filetype with better compression. – HeinzKurt Aug 11 '15 at 10:06
  • @HeinzKurt, interesting though, but it would have to be a particular situation in which that would speed up the program. – Cyphase Aug 11 '15 at 10:37
  • I can think of two possible situations: 1. you create the .txt file yourself, and can thus write to another filetype. 2. You have time to convert the .txt file, because you receive it long before you actually need it. But that is something Bruno Rein has to tell. – HeinzKurt Aug 11 '15 at 11:23
  • Interesting thought* – Cyphase Aug 11 '15 at 14:38

4 Answers4

5

So you mean there is no way to speed this up?, because my scenario is to read bunch of files then read each lines of it and store it to the database

The first rule of optimization is to ask yourself if you should bother. If your program is run only once or a couple of times optimizing it is a waste of time.

The second rule is that before you do anything else, measure where the problem lies;

Write a simple program that sequentially reads files, splits them into lines and stuffs those in a database. Run that program under a profiler to see where the program is spending most of its time.

Only then do you know which part of the program needs speeding up.


Here are some pointers nevertheless.

  • Speading up the reading of files can be done using mmap.
  • You could use multiprocessing.Pool to spread out the reading of multiple files over different cores. But then the data from those files will end up in different processes and would have to be sent back to the parent process using IPC. This has significant overhead for large amounts of data.
  • In the CPython implementation of Python, only one thread at a time can be executing Python bytecode. While the actual reading from files isn't inhibited by that, processing the results is. So it is questionable if threads would offer improvement.
  • Stuffing the lines into a database will probably always be a major bottleneck, because that is where everything comes together. How much of a problem this is depends on the database. Is it in-memory or on disk, does it allow multiple programs to update it simultaneously, et cetera.
Roland Smith
  • 42,427
  • 3
  • 64
  • 94
4

Multithreading or multiprocessing is not going to speed this up; your bottleneck is the storage device.

Cyphase
  • 11,502
  • 2
  • 31
  • 32
  • So you mean there is no way to speed this up?, because my scenario is to read bunch of files then read each lines of it and store it to the database. – Syntax Rommel Aug 11 '15 at 06:31
  • 1
    That's not what your code in the question is doing. Even in that case, unless you're doing a lot of processing on that data, your bottleneck is going to be IO; reading from the storage device and writing to the database. – Cyphase Aug 11 '15 at 06:33
  • 3
    Imagine you have 10 people (threads) carrying 1 bucket of water (data) per minute from one place (the storage device) to another (the database); if there's only 1 bucket of water (data) per minute being created at the source (the storage device), then no matter how many people (threads) you have, you can only move 1 bucket of water (data) per minute; and in fact, having more people (threads) is going to slow you down because of all the overhead. – Cyphase Aug 11 '15 at 09:56
  • 2
    Where having more people (threads) _would_ help is if you have a lengthy water purification (processing) step between getting the water (data) from the source (storage device) and taking it to the destination (database). Then you could have 10 people (threads) purifying (processing) at the same time. Note that when I say threads, I mean threads or processes. In Python specifically, you need to use processes to speed up computation, because of the Global Interpreter Lock (GIL). – Cyphase Aug 11 '15 at 09:57
  • I appreciate your information, but currently I have my temporary solution using asynchronous but my problem is I cannot predict if the reading is done. – Syntax Rommel Aug 11 '15 at 23:24
  • @BrunoRein, I'm not sure what you mean. If this is part of the same question, you should edit any clarifications into the question. If it's a different question, you should post a new question. – Cyphase Aug 11 '15 at 23:26
  • @BrunoRein, this question could do with more of your actual code :). – Cyphase Aug 11 '15 at 23:26
  • So that is my actual code for the first part printing the list of files, I just need a thread to make it fast. – Syntax Rommel Aug 12 '15 at 00:57
  • 1
    @BrunoRein, did you make an edit? I don't see it yet. If not, then again, multithreading/multiprocessing is not going to speed up getting a list of files. The bottleneck is the storage device. – Cyphase Aug 12 '15 at 01:02
  • Not true. I had a 16x speed up reading 64 files when I used 64 threads (which is what my cpu can handle) as opposed to a single one reading files one by one. – Edy Bourne Jul 17 '20 at 22:06
  • What language? Were you just reading, or doing some kind of processing? What sort of storage device (HDD, SSD, etc)? Were they all on the same storage device? How fast were the files being read? How did you measure the speedup? @EdyBourne – Cyphase Jul 20 '20 at 02:26
  • @Cyphase its Python, reading the file with a bit of processing yeah I split content from csv into an np array. This was likely causing me to be CPU bound. All read from same device, a mve2 array. I measured by having nothing else running other than the OS processes on linux mint, and a simple nanoseconds-based timer around the file reading function. My point being there are times where multi threading does help quite a bit - namely when you are CPU-bound. I'm not saying you were implying otherwise. It could come out that way so I'm just adding to it. If you are IO-bound, then it wont help. – Edy Bourne Jul 20 '20 at 16:53
2

You can get some speed-up, depending on the number and size of your files. See this answer to a similar question: Efficient file reading in python with need to split on '\n'

Essentially, you can read multiple files in parallel with multithreading, multiprocessing, or otherwise (e.g. an iterator)… and you may get some speedup. The easiest thing to do is to use a library like pathos (yes, I'm the author), which provides multiprocessing, multithreading, and other options in a single common API -- basically, so you can code it once, and then switch between different backends until you have what works the fastest for your case.

There are a lot of options for different types of maps (on the pool object), as you can see here: Python multiprocessing - tracking the process of pool.map operation.

While the following isn't the most imaginative of examples, it shows a doubly-nested map (equivalent to a doubly-nested for loop), and how easy it is to change the backends and other options on it.

>>> import pathos
>>> p = pathos.pools.ProcessPool()
>>> t = pathos.pools.ThreadPool()
>>> s = pathos.pools.SerialPool()
>>> 
>>> f = lambda x,y: x+y
>>> # two blocking maps, threads and processes
>>> t.map(p.map, [f]*5, [range(i,i+5) for i in range(5)], [range(i,i+5) for i in range(5)])
[[0, 2, 4, 6, 8], [2, 4, 6, 8, 10], [4, 6, 8, 10, 12], [6, 8, 10, 12, 14], [8, 10, 12, 14, 16]]
>>> # two blocking maps, threads and serial (i.e. python's map)
>>> t.map(s.map, [f]*5, [range(i,i+5) for i in range(5)], [range(i,i+5) for i in range(5)])
[[0, 2, 4, 6, 8], [2, 4, 6, 8, 10], [4, 6, 8, 10, 12], [6, 8, 10, 12, 14], [8, 10, 12, 14, 16]]
>>> # an unordered iterative and a blocking map, threads and serial
>>> t.uimap(s.map, [f]*5, [range(i,i+5) for i in range(5)], [range(i,i+5) for i in range(5)])
<multiprocess.pool.IMapUnorderedIterator object at 0x103dcaf50>
>>> list(_)
[[0, 2, 4, 6, 8], [2, 4, 6, 8, 10], [4, 6, 8, 10, 12], [6, 8, 10, 12, 14], [8, 10, 12, 14, 16]]
>>> 

I have found that generally, unordered iterative maps (uimap) are the fastest, but then you have to not care which order something is processed as it might get out of order on the return. As far as speed… surround the above with a call to time.time or similar.

Get pathos here: https://github.com/uqfoundation

Community
  • 1
  • 1
Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
1

In this case you can try to use multithreading. But keep in mind that every non atomic operation will run in a single thread, due to the Python GIL (Global Interpreter Lock). If you are running multiple machines it can be possible that you are faster. You can use something like a worker producer:

  • Producer (one thread) will hold the file list and a queue
  • Worker (more than one thread) will collect the file informations from the queue and push the content to the database

Look at the queues and pipes in multiprocessing (real separated subprocesses) to sidestep the GIL.

With this two communication objects you can build some nice blocking or non blocking programs.

Side note: keep in mind not every db connection is thread-safe.

BlackJack
  • 4,476
  • 1
  • 20
  • 25
Uwe
  • 11
  • 1