1

I have a series of large, flat text files that I need to parse in order insert into a SQL database. Each record spans multiple lines and consists of about a hundred fixed-length fields. I am trying to figure out how to efficiently parse them without loading the entire file into memory.

Each record starts with a numeric "1" as the first character on a new line (though not every line that starts with "1" is a new record) and terminates many lines later with a series of 20 spaces. While each field is fixed-width, each record is variable-length because it may or may not contain several optional fields. So I've been using "...20 spaces...\n1" as a record delimiter.

I've been trying to work with something like this to process 1kb at a time:

def read_in_chunks(file_object, chunk_size):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

file = open('test.txt')
for piece in read_in_chunks(file, chunk_size=1024):
   # Do stuff

However, the problem I'm running into is when a single record spans multiple chunks. Am I overlooking an obvious design pattern? This problem would seem to be somewhat common. Thanks!

jamieb
  • 9,847
  • 14
  • 48
  • 63
  • okay, i re-read the question...why not parse the file once just as a read-through to get the end of record positions of all records...then come back through feeding those values to your chunk size – RobotHumans Nov 15 '11 at 03:54
  • @aking1012: Thanks. That's the approach I've been working with since I wrote this question. However, reading a byte at a time takes forever and reading more than one byte at a time still leaves me with the problem of data spanning multiple chunks. I'm sure there's an obvious solution right in front of me. – jamieb Nov 15 '11 at 03:57
  • yeah, if you read two chunks at a time...then problem solved. in C it would be implemented as a circular buffer. i haven't seen a fast circular buffer implementation other than reading two chunks and winging it for python. if you go with the two chunks at a time, using rstrip/lstrip and getting the size difference might be effective...but I haven't load tested it. – RobotHumans Nov 15 '11 at 04:24
  • Two chunks won't work either, if a record can be longer than a single chunk. A 1026-byte record that starts on the last byte of a chunk will entirely span the next chunk and also the first byte of a *third* chunk. But reading the file in 1024-byte chunks is silly anyway, unless your system has *very* little memory. – rob mayoff Nov 15 '11 at 04:31
  • @robmayoff if you're only looking for the termination points...the record size doesn't matter any more...the size of the terminator does. then you tune your chunk size to memory size. just my opinion. on the how large is the file thing, when I hear large flat text file migrate to sql, i think financial institution with a many GB file – RobotHumans Nov 15 '11 at 04:58
  • I see... but if it's many GB in size, why would I want to read the entire file twice to load it? – rob mayoff Nov 15 '11 at 04:59

1 Answers1

3
def recordsFromFile(inputFile):
    record = ''
    terminator = ' ' * 20
    for line in inputFile:
        if line.startswith('1') and record.endswith(terminator):
            yield record
            record = ''
        record += line
    yield record

inputFile = open('test.txt')
for record in recordsFromFile(inputFile):
    # Do stuff

BTW, file is a built-in function. It's bad style to change its value.

rob mayoff
  • 375,296
  • 67
  • 796
  • 848