How to accelerate reads from batches of files

Question

I read many files from my system. I want to read them faster, maybe like this:

results=[]
for file in open("filenames.txt").readlines():
    results.append(open(file,"r").read())

I don't want to use threading. Any advice is appreciated.

the reason why i don't want to use threads is because it will make my code unreadable,i want to find so tricky way to make speed faster and code lesser,unstander easier

yesterday i have test another solution with multi-processing,it works bad,i don't know why, here is the code as follows:

def xml2db(file):
    s=pq(open(file,"r").read())
    dict={}
    for field in g_fields:
        dict[field]=s("field[@name='%s']"%field).text()
    p=Product()
    for k,v in dict.iteritems():
        if v is None or v.strip()=="":
            pass
        else:
            if hasattr(p,k):
                setattr(p,k,v)
    session.commit()
@cost_time
@statistics_db
def batch_xml2db():
    from multiprocessing import Pool,Queue
    p=Pool(5)
    #q=Queue()
    files=glob.glob(g_filter)
    #for file in files:
    #    q.put(file)

    def P():
        while q.qsize()<>0:
            xml2db(q.get())
    p.map(xml2db,files)
    p.join()

So you want to read it quicker than normal without using threads? Any program can only do one thing per thread at a time. — dutt, Oct 06 '10 at 03:43
@dutt: Actually, it can be made faster by using mechanisms like overlapped I/O in Windows and AIO in Linux (poorly-supported, I think), to let the OS reorder reads to match the disk layout and reduce seeking. I seriously doubt he actually wants to do that, though; it's much more complex and won't always make a big difference. — Glenn Maynard, Oct 06 '10 at 04:02
@dutt the reason why i don't want to use threads is because it will make my code unreadable,i want to find so tricky way to make speed faster and code lesser,unstander easier — mlzboy, Oct 06 '10 at 04:19
do you mean "the reason why I don't want to use threads is because it will make my code unreadable, I want to find _some_ tricky way to make _the program_ faster, _fewer lines of code_, and _easier to understand_ "? Currently the sentence falls apart at the end. WRT threaded code being unreadable - that is debatable. There are high-level constructs that make threaded code very readable as well as simple organizational techniques for traditional thread APIs. — Thomas M. DuBuisson, Oct 06 '10 at 06:00

score 1 · Accepted Answer · answered Oct 06 '10 at 04:48

1

results = [open(f.strip()).read() for f in open("filenames.txt").readlines()]

This may be insignificantly faster, but it's probably less readable (depending on the reader's familiarity with list comprehensions).

Your main problem here is that your bottleneck is disk IO - buying a faster disk will make much more of a difference than modifying your code.

answered Oct 06 '10 at 04:48

Zooba

11,221
3
37
40

it's reduce the cost time from 805s to 714s,thanks, but i still want to find any futher quick way,also i want to know why this style will quicker than my code – mlzboy Oct 06 '10 at 05:21
@mlzboy, you must have a whole lot of little files here if you had that big of a speed-up from this. Is this code inside of a function? That could speed things up a little bit too. – Justin Peel Oct 06 '10 at 07:05
This is quicker because the loop is handled as part of the list comprehension, which is quite likely in native code rather than interpreted Python. Without threading you are unlikely to get any faster than this (using Python). The disk IO is still your bottleneck. – Zooba Oct 06 '10 at 08:57

score 0 · Answer 2 · edited May 23 '17 at 12:04

Well, if you want to improve performance then improve the algorithm, right? What are you doing with all this data? Do you really need it all in memory at the same time, potentially causing OOM if filenames.txt specifies too many or too large of files?

If you're doing this with lots of files I suspect you are thrashing, hence your 700s+ (1 hour+) time. Even my poor little HD can sustain 42 MB/s writes (42 * 714s = 30GB). Take that grain of salt knowing you must read and write, but I'm guessing you don't have over 8 GB of RAM available for this application. A related SO question/answer suggested you use mmap, and the answer above that suggested an iterative/lazy read like what you get in Haskell for free. These are probably worth considering if you really do have tens of gigabytes to munge.

score 0 · Answer 3 · answered Oct 07 '10 at 02:58

0

Is this a one-off requirement or something that you need to do regularly? If it's something you're going to be doing often, consider using MySQL or another database instead of a file system.

answered Oct 07 '10 at 02:58

Richard Careaga

628
1
5
11

i'm extract data from html and persistentce to many xml files,then i put xml to mysql database – mlzboy Oct 07 '10 at 04:06
Ah, then I think that you will get a greater benefit from going directly from html into xml directly to the database without the intermediate step of writing to the file system. That saves both a write and a read. Alternatively, put your xml files under your webserver and read them in through a socket, rather than calling to the OS filesystem. – Richard Careaga Oct 11 '10 at 05:04

score 0 · Answer 4 · answered Oct 10 '10 at 06:03

Not sure if this is still the code you are using. A couple adjustments I would consider making.

Original:

def xml2db(file):
    s=pq(open(file,"r").read())
    dict={}
    for field in g_fields:
        dict[field]=s("field[@name='%s']"%field).text()
    p=Product()
    for k,v in dict.iteritems():
        if v is None or v.strip()=="":
            pass
        else:
            if hasattr(p,k):
                setattr(p,k,v)
    session.commit()

Updated: remove the use of the dict, it is extra object creation, iteration and collection.

def xml2db(file):
    s=pq(open(file,"r").read())

    p=Product()
    for k in g_fields:
        v=s("field[@name='%s']"%field).text()
        if v is None or v.strip()=="":
            pass
        else:
            if hasattr(p,k):
                setattr(p,k,v)
    session.commit()

You could profile the code using the python profiler.

This might tell you where the time being spent is. It may be in session.Commit() this may need to be reduced to every couple of files.

I have no idea what it does so that is really a stab in the dark, you may try and run it without sending or writing any output.

If you can separate your code into Reading, Processing and Writing. A) You can see how long it takes to read all the files.

Then by loading a single file into memory process it enough time to represent the entire job without extra reading IO. B) Processing cost

Then save a whole bunch of sessions representative of your job size. C) Output cost

Test the cost of each stage individually. This should show you what is taking the most time and if any improvement can be made in any area.

How to accelerate reads from batches of files

4 Answers4