40

I have multiple 3 GB tab delimited files. There are 20 million rows in each file. All the rows have to be independently processed, no relation between any two rows. My question is, what will be faster?

  1. Reading line-by-line?

    with open() as infile:
        for line in infile:
    
  2. Reading the file into memory in chunks and processing it, say 250 MB at a time?

The processing is not very complicated, I am just grabbing value in column1 to List1, column2 to List2 etc. Might need to add some column values together.

I am using python 2.7 on a linux box that has 30GB of memory. ASCII Text.

Any way to speed things up in parallel? Right now I am using the former method and the process is very slow. Is using any CSVReader module going to help? I don't have to do it in python, any other language or database use ideas are welcome.

Jaroslav Bezděk
  • 6,967
  • 6
  • 29
  • 46
Reise45
  • 1,163
  • 4
  • 18
  • 23
  • multiprocessing; chunked iterative reading. At 3GB per file you **DO NOT** want to be reading this entirely into memory; you may blow your memory resources. – James Mills May 18 '15 at 02:12
  • It sounds like a database would help you out depending on the type of processing that you are doing. – squiguy May 18 '15 at 02:16
  • Not if this is a single-throw-away task; data-in; processing; data-out; delete source data. – James Mills May 18 '15 at 02:21
  • 4
    Is your code I/O-bound, or CPU-bound? In other words, does the processing take more time than the reading? If so, you can probably speed it up with multiprocessing; if not, your background processes are just going to spend all their time waiting on the next read and you'll get no benefit. – abarnert May 18 '15 at 02:21
  • Did you check whether your slowness is in processing or in reading? How fast is it if the only thing in your `for` loop is `pass`? Without checking, you might be trying to optimise the wrong thing. – Amadan May 18 '15 at 02:22
  • @abarnert Makes a very good and clear point here. Your solution is going to depend whether your problem is I/O or CPU bound. Although at first glance it looks like it might be I/O bound :) – James Mills May 18 '15 at 02:23
  • 1
    Meanwhile, `for line in infile:` already does decent buffering inside the `io` module code (in Python 3.1+) or inside the C stdio underneath (in Python 2.x), so unless you're using Python 3.0, it should be fine. But if you want to force it to use larger buffers, you can always loop over, say, `infile.readlines(65536)` and then loop over the lines within each chunk. – abarnert May 18 '15 at 02:23
  • 1
    Also, it probably makes a big difference whether this is 2.x or 3.x, which 3.x version if 3.x, what platform you're on, and whether this is ASCII text or something that really needs to be decoded, so please add that information. – abarnert May 18 '15 at 02:24
  • Please also add some details about the kinds of "processing" that's performed on the datasets. – James Mills May 18 '15 at 02:25
  • 1
    @abarnert "decent" at best. if s/he had plenty of memory and didn't care about the 3GB hit, s/he could do `for line in infile.readlines():` which will be much faster to iterate over than the file object itself – the_constant May 18 '15 at 02:41
  • 1
    @Vincenzzzochi agreed; assuming one could cope with such a massive hit on memory! Bare in mind that this will not consume 3GB of memory but much much more! – James Mills May 18 '15 at 02:43
  • 1
    @Vincenzzzochi: Using `readlines()` is almost always going to be slower than looping over `readlines(bufsize)` with a large but not 3GB buffer, because you can't read any faster than some maximum size at a time, so beyond that all you're doing is adding unnecessary memory allocation to the mix, plus VM page misses. – abarnert May 18 '15 at 02:45
  • @Reise45 Could you please show some of your code that's "doing the actual processing"? Being vague about what it's doing isn't that helpful. – James Mills May 18 '15 at 02:48
  • @JamesMills yes, forgot to mention it's more than a 3Gb hit, which is an important point.. thanks! either way, we're stuck between a bottleneck of CPU, or a bottleneck of memory, or a bottleneck of time. what kind of CPU's are we developing this for? if they're dual core, subprocesses take a fairly big hit. how big is the application/program this is in? is there a dire need for speed? to be brutally honest, if you want a language that is "fast", it's not going to be python... – the_constant May 18 '15 at 02:49
  • 2
    @Vincenzzzochi Actually I've personally had a lot of experience dealing with processing "BIg Data" using Python and it fares quite well if you design your solutions correctly; again depending on the nature of your problem CPU Bound vs. I/O Bound or a bit of both. Python **isn't** really that slow :) – James Mills May 18 '15 at 02:51
  • @Vincenzzzochi What other language would you suggest to try this work in? Thanks – Reise45 May 18 '15 at 02:54
  • 1
    @Reise45 The language choice isn't really your problem here; rather how you're managing I/O and how you're delegating the work (*CPU Bound parts*). – James Mills May 18 '15 at 02:57
  • As an aside; if Python was uselessly slow at Big Data analysis-type tasks; it would noe be so widely used in the scientific community and various research projects :) – James Mills May 18 '15 at 02:58
  • 1
    @JamesMills: Of course in the scientific community, you're often processing things diagonally rather than sequentially, and it may be acceptable to say "we have 18GB of data? then let's get 32GB of RAM" because, only other big-data uses, you're not speccing out dozens of servers, just one workstation… But of course Python is also used in plenty of big-data server-type uses, too, so your point is definitely valid. – abarnert May 18 '15 at 03:02
  • 1
    @Reise45 what other languages do you know? are you okay with compiled languages (i.e. slower to initialize)? etc... in terms of fastest, C which one implementation of python uses as the underlying mechanism for the language, C++ is faster, java is likely to be faster, etc.. but you're talking lower level languages compared to python – the_constant May 18 '15 at 03:04
  • 1
    @abarnert to your comment earlier: yes, `readlines()` is slower IF your computer can't allocate the necessary memory for it at once, but if we're talking about having like 64Gb of RAM, then that's a non-factor. also, if you're often processing things in parallel instead of sequential (i think you meant?), then once again, you probably don't want python... – the_constant May 18 '15 at 03:10
  • C# is quite similar in runtime/compile-time performance to that of Java. So probably. The trade-off here ofc is development time and compile time. – James Mills May 18 '15 at 03:10
  • You *could* always just use PyPY as your alternative Python implementation here; **BUT** your problem is that of I/O bound; not CPU Bound so PyPy is unlikely to speed things up that much; but it would certainly be faster than using CPython. – James Mills May 18 '15 at 03:11
  • 1
    @Reise45 piggy-backing off of JamesMills, yes, C# is comparable to Java. There will be a slower program/app initialization time compared to the python app, but you benefit a lot once it's running. Compiled versus Interpreted. It also is harder and longer to develop. However, it handles threading in like, a world better of performance than python as python's GIL is a deal breaker (keep in mind, I'm discussing **threading** and not **multiprocessing**) – the_constant May 18 '15 at 03:14
  • Off the back of @ Vincenzzzochi's comment; The Python GIL is all but gone in the [PyPy-STM](http://pypy.readthedocs.org/en/latest/stm.html) implementation so in the future hopefully we can all *look forward* to a Python implementation with better multi-threading :) – James Mills May 18 '15 at 03:17
  • 1
    @Vincenzzzochi: Even if you have 64GB of RAM, `readlines()` is _still_ usually slower than iterating over `readlines(bufsize)`, because you still need to malloc 12GB instead of 1MB, and you may well have to page-fault millions of times instead of 0. Also, of course, you have to do all the I/O first, then all the processing, instead of being able to pipeline the two. And that's ignoring the fact that your 64GB may be NUMA, while your 1MB may fit in a local cache, etc. – abarnert May 18 '15 at 03:17
  • @Vincenzzzochi Is the benefit from C# that Parallel.ForEach can be used to complete the processing of chunks together and thus cut down on total run time? – Reise45 May 18 '15 at 03:28

2 Answers2

41

It sounds like your code is I/O bound. This means that multiprocessing isn't going to help—if you spend 90% of your time reading from disk, having an extra 7 processes waiting on the next read isn't going to help anything.

And, while using a CSV reading module (whether the stdlib's csv or something like NumPy or Pandas) may be a good idea for simplicity, it's unlikely to make much difference in performance.

Still, it's worth checking that you really are I/O bound, instead of just guessing. Run your program and see whether your CPU usage is close to 0% or close to 100% or a core. Do what Amadan suggested in a comment, and run your program with just pass for the processing and see whether that cuts off 5% of the time or 70%. You may even want to try comparing with a loop over os.open and os.read(1024*1024) or something and see if that's any faster.


Since your using Python 2.x, Python is relying on the C stdio library to guess how much to buffer at a time, so it might be worth forcing it to buffer more. The simplest way to do that is to use readlines(bufsize) for some large bufsize. (You can try different numbers and measure them to see where the peak is. In my experience, usually anything from 64K-8MB is about the same, but depending on your system that may be different—especially if you're, e.g., reading off a network filesystem with great throughput but horrible latency that swamps the throughput-vs.-latency of the actual physical drive and the caching the OS does.)

So, for example:

bufsize = 65536
with open(path) as infile: 
    while True:
        lines = infile.readlines(bufsize)
        if not lines:
            break
        for line in lines:
            process(line)

Meanwhile, assuming you're on a 64-bit system, you may want to try using mmap instead of reading the file in the first place. This certainly isn't guaranteed to be better, but it may be better, depending on your system. For example:

with open(path) as infile:
    m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)

A Python mmap is sort of a weird object—it acts like a str and like a file at the same time, so you can, e.g., manually iterate scanning for newlines, or you can call readline on it as if it were a file. Both of those will take more processing from Python than iterating the file as lines or doing batch readlines (because a loop that would be in C is now in pure Python… although maybe you can get around that with re, or with a simple Cython extension?)… but the I/O advantage of the OS knowing what you're doing with the mapping may swamp the CPU disadvantage.

Unfortunately, Python doesn't expose the madvise call that you'd use to tweak things in an attempt to optimize this in C (e.g., explicitly setting MADV_SEQUENTIAL instead of making the kernel guess, or forcing transparent huge pages)—but you can actually ctypes the function out of libc.

Donald Duck
  • 8,409
  • 22
  • 75
  • 99
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I have 30 GB of memory on the linux box. Is there any problem is doing a readlines() to take the entire file into memory? – Reise45 May 18 '15 at 13:59
  • @Reise45: It depends on what you mean by "problem". It should _work_; `readlines` on a 3GB file should take under 4GB, and if you also pre-process all the lines into lists of values in memory, that shouldn't be more than maybe 12GB, so you're still within comfortable limits. But it means you have to do all the reading up front, so the OS can't help pipeline your I/O waiting and your CPU work; you waste time on malloc and cache faults; etc. If there were some benefit (e.g., it let you use NumPy to speed up a slow processing loop), that might be worth it, but if not, why do it? – abarnert May 18 '15 at 19:20
  • @Reise45: Meanwhile, if you have lots of these files, and each one takes, say, 25 minutes the way you're currently doing it, just try doing one of them the other way and see if it finishes in 15 minutes or you have to cancel it after an hour; that'll tell you a lot more than you can get by guessing. – abarnert May 18 '15 at 19:22
  • I am reading using a buffersize but the script still gets killed as mem% gets to 100%. How do I prevent that? Do I need to fix the data structure reading the data? – Reise45 May 19 '15 at 18:59
  • @Reise45 If you're progressively building up a data structure that's too big to fit into 30GB of RAM, then yes, that's your problem. Without knowing more about your code it's hard to say anything more specific. – abarnert May 19 '15 at 19:22
  • @abarnert Yes I am. I am trying to write out all the parts that I do not need for further calculations. That hopefully frees up enough memory. I noticed CPU% never goes over 15-20%, How do I utilize that better? I am assuming the memory use will stay the same and the runtime will come down if I can do that? – Reise45 May 21 '15 at 13:46
  • Let's say I have RAM of 16G and file is 18G, i understand that i can create chunks and process it but is there a better way to achieve the same. – Puneet Tripathi Sep 27 '17 at 09:14
8

I know this question is old; but I wanted to do a similar thing, I created a simple framework which helps you read and process a large file in parallel. Leaving what I tried as an answer.

This is the code, I give an example in the end

def chunkify_file(fname, size=1024*1024*1000, skiplines=-1):
    """
    function to divide a large text file into chunks each having size ~= size so that the chunks are line aligned

    Params : 
        fname : path to the file to be chunked
        size : size of each chink is ~> this
        skiplines : number of lines in the begining to skip, -1 means don't skip any lines
    Returns : 
        start and end position of chunks in Bytes
    """
    chunks = []
    fileEnd = os.path.getsize(fname)
    with open(fname, "rb") as f:
        if(skiplines > 0):
            for i in range(skiplines):
                f.readline()

        chunkEnd = f.tell()
        count = 0
        while True:
            chunkStart = chunkEnd
            f.seek(f.tell() + size, os.SEEK_SET)
            f.readline()  # make this chunk line aligned
            chunkEnd = f.tell()
            chunks.append((chunkStart, chunkEnd - chunkStart, fname))
            count+=1

            if chunkEnd > fileEnd:
                break
    return chunks

def parallel_apply_line_by_line_chunk(chunk_data):
    """
    function to apply a function to each line in a chunk

    Params :
        chunk_data : the data for this chunk 
    Returns :
        list of the non-None results for this chunk
    """
    chunk_start, chunk_size, file_path, func_apply = chunk_data[:4]
    func_args = chunk_data[4:]

    t1 = time.time()
    chunk_res = []
    with open(file_path, "rb") as f:
        f.seek(chunk_start)
        cont = f.read(chunk_size).decode(encoding='utf-8')
        lines = cont.splitlines()

        for i,line in enumerate(lines):
            ret = func_apply(line, *func_args)
            if(ret != None):
                chunk_res.append(ret)
    return chunk_res

def parallel_apply_line_by_line(input_file_path, chunk_size_factor, num_procs, skiplines, func_apply, func_args, fout=None):
    """
    function to apply a supplied function line by line in parallel

    Params :
        input_file_path : path to input file
        chunk_size_factor : size of 1 chunk in MB
        num_procs : number of parallel processes to spawn, max used is num of available cores - 1
        skiplines : number of top lines to skip while processing
        func_apply : a function which expects a line and outputs None for lines we don't want processed
        func_args : arguments to function func_apply
        fout : do we want to output the processed lines to a file
    Returns :
        list of the non-None results obtained be processing each line
    """
    num_parallel = min(num_procs, psutil.cpu_count()) - 1

    jobs = chunkify_file(input_file_path, 1024 * 1024 * chunk_size_factor, skiplines)

    jobs = [list(x) + [func_apply] + func_args for x in jobs]

    print("Starting the parallel pool for {} jobs ".format(len(jobs)))

    lines_counter = 0

    pool = mp.Pool(num_parallel, maxtasksperchild=1000)  # maxtaskperchild - if not supplied some weird happend and memory blows as the processes keep on lingering

    outputs = []
    for i in range(0, len(jobs), num_parallel):
        print("Chunk start = ", i)
        t1 = time.time()
        chunk_outputs = pool.map(parallel_apply_line_by_line_chunk, jobs[i : i + num_parallel])

        for i, subl in enumerate(chunk_outputs):
            for x in subl:
                if(fout != None):
                    print(x, file=fout)
                else:
                    outputs.append(x)
                lines_counter += 1
        del(chunk_outputs)
        gc.collect()
        print("All Done in time ", time.time() - t1)

    print("Total lines we have = {}".format(lines_counter))

    pool.close()
    pool.terminate()
    return outputs

Say for example, I have a file in which I want to count the number of words in each line, then the processing of each line would look like

def count_words_line(line):
    return len(line.strip().split())

and then call the function like:

parallel_apply_line_by_line(input_file_path, 100, 8, 0, count_words_line, [], fout=None)

Using this, I get a speed up of ~8 times as compared to vanilla line by line reading on a sample file of size ~20GB in which I do some moderately complicated processing on each line.

Deepak Saini
  • 2,810
  • 1
  • 19
  • 26
  • doesn't this method leave you with a potential case where a line is broken at a 100byte chunk and the otehr line is counted as a different line? When you spit files into byte chunks, you never know where the current line will be broken in order to meet that space requirement – edo101 Jun 03 '20 at 23:38
  • 1
    there is a ``readline()`` to seek the file pointer to the line end so that you get line alligned chunks – Deepak Saini Jun 04 '20 at 04:23
  • does the chunk thing matter if you are reading the file as binary? If you do 'rb' doesn't that negate \n. And if that's the case, do you still need to worry about chunks of the file being cut off? – edo101 Jun 04 '20 at 21:43