4

I am using rake (Rapid automatics keyword extraction algo) to generate keywords. I have around 53million records, ~ 4.6gb. I want to know the best possible way to do this.

I have rake nicely wrapped up in a class. I have a 4.5gb file which consists of 53 million records. Below are some of the approaches.

Approach #1:

with open("~inputfile.csv") as fd:
   for line in fd:
      keywords = rake.run(line)
      write(keywords)

This is a basic brute force way. Assuming that writing to file takes time, invoking it 53 million times would be costly. I used the below approach, writing 100K lines to file at one go.

Approach #2

with open("~inputfile.csv") as fd:
temp_string = ''
counter = 0
   for line in fd:
      keywords = rake.run(line)
      string = string + keywords + '\n'
      counter += 1
      if counter == 100000:
           write(string)
           string = ''

To my surprise Approach #2 took more time than approach #1. I don't get it! How is that possible? Also can you guys suggest a better approach?

Approach #3 (thanks to cefstat)

with open("~inputfile.csv") as fd:
  strings = []
  counter = 0
  for line in fd:
    strings.append(rake.run(line))
    counter += 1
    if counter == 100000:
      write("\n".join(strings))
      write("\n")
      strings = []

Runs faster than Approaches #1 & #2.

Thanks in advance!

Siddarth
  • 1,000
  • 1
  • 10
  • 17
  • 8
    Python buffers file operations automatically; buffering them by hand is likely to produce inferior performance since Python's buffers are written in C. – Kevin Mar 09 '15 at 20:32
  • 5
    approach 2 is what we call premature optimization. normally, in file IO implementation, there are builtin memory cache already. it's totally not necessary. – Jason Hu Mar 09 '15 at 20:32
  • 9
    Strings are immutable -- this `string = string + keywords + '\n'` will be *very* expensive. – jedwards Mar 09 '15 at 20:32
  • Thanks for clarifying my doubt guys, can you suggest a better approach? – Siddarth Mar 09 '15 at 20:43

2 Answers2

3

As mentioned in the comments, Python already buffers writing to files, so you implementing your own in Python (as opposed to C, like it already is) is going to make it slower. You can adjust the buffer size with arguments to your call to open.

A different approach would be to read the file in chunks instead. The basic algorithm would be this:

  1. Iterate over the file using file.seek(x) where x = current position + desired chunk size
  2. While iterating, record the start end end byte positions of each chunk
  3. In a worker processing (using multiprocessing.Pool()), read in the chunk using start and end byte positions
  4. Each process writes to its own keyword file

  5. Reconcile the seperate files. You have a few options for doing this:

    • Read the keyword files back into memory into a single list
    • If on *nix, combine keyword files using the "cat" command.
    • If you're on Windows, you could keep a list of the keyword files (instead of one file path) and iterate through those files as needed

There are numerous blogs and recipes out there for reading large files in parallel:

https://stackoverflow.com/a/8717312/2615940
http://aamirhussain.com/2013/10/02/parsing-large-csv-files-in-python/
http://www.ngcrawford.com/2012/03/29/python-multiprocessing-large-files/
http://effbot.org/zone/wide-finder.htm

Side note: I once tried doing the same thing and had the same result. It also doesn't help to outsource the file writing to another thread (at least it didn't when I tried).

Here's a code snippet that demonstrates the algorithm:

import functools
import multiprocessing

BYTES_PER_MB = 1048576

# stand-in for whatever processing you need to do on each line
# for demonstration, we'll just grab the first character of every non-empty line
def line_processor(line):
    try:
        return line[0]
    except IndexError:
        return None

# here's your worker function that executes in a worker process
def parser(file_name, start, end):

    with open(file_name) as infile:

        # get to proper starting position
        infile.seek(start)

        # use read() to force exactly the number of bytes we want
        lines = infile.read(end - start).split("\n")

    return [line_processor(line) for line in lines]

# this function splits the file into chunks and returns the start and end byte
# positions of each chunk
def chunk_file(file_name):

    chunk_start = 0
    chunk_size = 512 * BYTES_PER_MB # 512 MB chunk size

    with open(file_name) as infile:

        # we can't use the 'for line in infile' construct because fi.tell()
        # is not accurate during that kind of iteration

        while True:
            # move chunk end to the end of this chunk
            chunk_end = chunk_start + chunk_size
            infile.seek(chunk_end)

            # reading a line will advance the FP to the end of the line so that
            # chunks don't break lines
            line = infile.readline()

            # check to see if we've read past the end of the file
            if line == '':
                yield (chunk_start, chunk_end)
                break

            # adjust chunk end to ensure it didn't break a line
            chunk_end = infile.tell()

            yield (chunk_start, chunk_end)

            # move starting point to the beginning of the new chunk
            chunk_start = chunk_end

    return

if __name__ == "__main__":

    pool = multiprocessing.Pool()

    keywords = []

    file_name = # enter your file name here

    # bind the file name argument to the parsing function so we dont' have to
    # explicitly pass it every time
    new_parser = functools.partial(parser, file_name)

    # chunk out the file and launch the subprocesses in one step
    for keyword_list in pool.starmap(new_parser, chunk_file(file_name)):

        # as each list is available, extend the keyword list with the new one
        # there are definitely faster ways to do this - have a look at 
        # itertools.chain() for other ways to iterate over or combine your
        # keyword lists
        keywords.extend(keyword_list) 

    # now do whatever you need to do with your list of keywords
Community
  • 1
  • 1
skrrgwasme
  • 9,358
  • 11
  • 54
  • 84
  • Thanks @skrrgwasme if I specify 1gb at Open(), does that bring complete 1 GB to memory? Also if I chunk the file on size I may not get the complete line at EOF, the last line may split into two files. Is there a way to avoid that from happening? – Siddarth Mar 09 '15 at 20:56
  • 1
    First question: no, `open()` alone doesn't bring anything into memory. That happens when you start reading. Second question: Yup. I'm working on a simplified MCVE that I will post in a moment. – skrrgwasme Mar 09 '15 at 20:58
  • I believe you have described lazy function ( chunking the file ). But isn't `for line in fd:` in itself a lazy function? I maybe wrong, just wanted to know the difference. – Siddarth Mar 09 '15 at 20:58
  • 1
    No, this algorithm isn't an example of lazy evaluation. It is an algorithm for dividing a file up and parsing it segments where each segment is parsed in parallel. By the way, I just fixed a large typo - I was missing an important seek. I wouldn't consider it lazy because it doesn't wait until you request the keyword to produce it. Everything is parsed up front. – skrrgwasme Mar 09 '15 at 21:30
  • 2
    I apologize for not testing my snippet better before posting, but it should be good now. I've fixed a few more typos and added the usage of `functools.partial()` to reduce extra argument passing. – skrrgwasme Mar 09 '15 at 21:42
  • Thank you @skrrgwasme, i will try and understand the code. – Siddarth Mar 09 '15 at 21:49
  • With regards to a line splitting between two chunks it gets a little messy. In def parser you could move your start backwards until you find EOL and then move your end backwards as well. But you need to make sure that your file ends with your delimiter (EOL) so the final line is not thrown away. If double-processing a partial line and partial words is acceptable then don't worry about moving your end. – user922020 Mar 09 '15 at 22:29
  • 1
    @user922020 The usage of the `readline()` function before marking the end of a chunk ensures that lines won't be split between chunks. You are correct that this method does strip newlines, which could be a problem if the OP's `rake.run` method needs them on each line. But if that is the case, adding them back in is trivial and the time necessary to do that is still minor compared to the speedup gained from processing the file in parallel. Either way, having one at the end of the file shouldn't really matter. – skrrgwasme Mar 09 '15 at 22:39
3

Repeatedly adding strings is very slow in Python (as mentioned by jedwards). You can try the following standard alternative which should almost certainly be faster than #2 and which in my limited testing appears to be 30% faster than approach #1 (although perhaps not fast enough for your needs):

with open("~inputfile.csv") as fd:
  strings = []
  counter = 0
  for line in fd:
    strings.append(rake.run(line))
    counter += 1
    if counter == 100000:
      write("\n".join(strings))
      write("\n")
      strings = []
cefstat
  • 2,336
  • 1
  • 18
  • 25