0

I have a large file (1.6 gigs) with millions of rows that has columns delimited with:

[||]

I have tried to use the csv module but it says I can only use a single character as a delimiter. So Here is what I have:

fileHandle = open('test.txt', 'r', encoding="UTF-16")

thelist = []

for line in fileHandle:
    fields = line.split('[||]')

    therow = {
        'dea_reg_nbr':fields[0],
        'bus_actvty_cd':fields[1],
        'drug_schd':fields[3],
        #50 more columns like this
    }
    thelist.append(therow)

fileHandle.close()

#now I have thelist which is what I want

And boom, now I have a list of dictionaries and it works. I want a list because I care about the order, and the dictionary because downstream it's being expected. This just feels like I should be taking advantage of something more efficient. I don't think this scales well with over a million rows and so much data. So, my question as follows:

What would be the more efficient way of taking a multi-character delimited text file (UTF-16 encoded) and creating a list of dictionaries?

Any thoughts would be appreciated!

sniperd
  • 5,124
  • 6
  • 28
  • 44
  • 2
    Just so you know, dictionaries are ordered in python 3.6 ;) – tupui Aug 01 '17 at 15:15
  • 2
    One obvious improvement is to use a `generator` and `yield` row instead of using a `list`, but this isn't a viable improvement if you really need the output as a `list`. – Tom Wyllie Aug 01 '17 at 15:15
  • 3
    This could be a good day to start to work with pandas. – Igor Pejic Aug 01 '17 at 15:17
  • 1
    when you don't think the data will fit in RAM put it in DB. And depending on how you want to process it, if you want to parallel process it you can use `pySpark` also. – Vikash Singh Aug 01 '17 at 15:22
  • 2
    Speaking about memory usage, `namedtuple` or a simple class with `__slots__` will save some bytes by avoiding creation of one `dict` for each row. See [this answer](https://stackoverflow.com/a/28059785/1202745) – robyschek Aug 01 '17 at 15:33
  • 2
    @TomWyllie: Often people creates a list when what they actually need is an iterable... If OP later processes one dictionnary at a time, the generator will be perfectly fine and highly scalable – Serge Ballesta Aug 01 '17 at 15:42
  • There are some great idea here. I'm digging it! I'm a little stuck in some pre-code so I can't move to Pandas, but... for my next project it will be ground up so starting there will be a real option. I like these pySpark, generator, etc. ideas. This is exactly what I'm looking for! – sniperd Aug 01 '17 at 17:07

1 Answers1

1

One way to make it scale better is to use a generator instead of loading all million rows into memory at once. This may or may not be possible depending on your use-case; it will work best if you only need to make one pass over the full data set. Multiple passes will require you to either store all the data in memory in some form or another or to read it from the file multiple times.

Anyway, here's an example of how you could use a generator for this problem:

def file_records():
    with open('test.txt', 'r', encoding='UTF-16') as fileHandle:
        for line in fileHandle:
            fields = line.split('[||]')
            therow = {
                'dea_reg_nbr':fields[0],
                'bus_actvty_cd':fields[1],
                'drug_schd':fields[3],
                #50 more columns like this
            }
            yield therow

for record in file_records():
    # do work on one record

The function file_records is a generator function because of the yield keyword. When this function is called, it returns an iterator that you can iterate over exactly like a list. The records will be returned in order, and each one will be a dictionary.

If you're unfamiliar with generators, this is a good place to start reading about them.

The thing that makes this scale so well is that you will only ever have one therow in memory at a time. Essentially what's happening is that at the beginning of every iteration of the loop, the file_records function is reading the next line of the file and returning the computed record. It will wait until the next row is needed before doing the work, and the previous record won't linger in memory unless it's needed (such as if it's referenced in whatever data structure you build in # do work on one record).

Note also that I moved the open call to a with statement. This will ensure that the file gets closed and all related resources are freed once the iteration is done or an exception is raised. This is much simpler than trying to catch all those cases yourself and calling fileHandle.close().

W. Marshall
  • 120
  • 1
  • 7