3

My first post:

Before beginning, I should note I am relatively new to OOP, though I have done DB/stat work in SAS, R, etc., so my question may not be well posed: please let me know if I need to clarify anything.

My question:

I am attempting to import and parse large CSV files (~6MM rows and larger likely to come). The two limitations that I've run into repeatedly have been runtime and memory (32-bit implementation of Python). Below is a simplified version of my neophyte (nth) attempt at importing and parsing in reasonable time. How can I speed up this process? I am splitting the file as I import and performing interim summaries due to memory limitations and using pandas for the summarization:

Parsing and Summarization:

def ParseInts(inString):
    try:
        return int(inString)
    except:
        return None
def TextToYearMo(inString):
    try:
        return 100*inString[0:4]+int(inString[5:7])
    except:
        return 100*inString[0:4]+int(inString[5:6])
def ParseAllElements(elmValue,elmPos):
    if elmPos in [0,2,5]:
        return elmValue
    elif elmPos == 3:
        return TextToYearMo(elmValue)
    else:
        if elmPos == 18:
            return ParseInts(elmValue.strip('\n'))
        else:
            return ParseInts(elmValue)

def MakeAndSumList(inList):
    df = pd.DataFrame(inList, columns = ['x1','x2','x3','x4','x5',
                                         'x6','x7','x8','x9','x10',
                                         'x11','x12','x13','x14'])
    return df[['x1','x2','x3','x4','x5',
               'x6','x7','x8','x9','x10',
               'x11','x12','x13','x14']].groupby(
               ['x1','x2','x3','x4','x5']).sum().reset_index()

Function Calls:

def ParsedSummary(longString,delimtr,rowNum):
    keepColumns = [0,3,2,5,10,9,11,12,13,14,15,16,17,18]

    #Do some other stuff that takes very little time

    return [pse.ParseAllElements(longString.split(delimtr)[i],i) for i in keepColumns]

def CSVToList(fileName, delimtr=','):
    with open(fileName) as f:
        enumFile = enumerate(f)
        listEnumFile = set(enumFile)
        for lineCount, l in enumFile:
            pass

        maxSplit = math.floor(lineCount / 10) + 1

        counter = 0
        Summary = pd.DataFrame({}, columns = ['x1','x2','x3','x4','x5',
                                              'x6','x7','x8','x9','x10',
                                              'x11','x12','x13','x14'])
        for counter in range(0,10):
            startRow     = int(counter * maxSplit)
            endRow       = int((counter + 1) * maxSplit)
            includedRows = set(range(startRow,endRow))

            listOfRows = [ParsedSummary(row,delimtr,rownum) 
                            for rownum, row in listEnumFile if rownum in includedRows]
            Summary = pd.concat([Summary,pse.MakeAndSumList(listOfRows)])

            listOfRows = []
            counter += 1
    return Summary

(Again, this is my first question - so I apologize if I simplified too much or, more likely, too little, but I am at a loss as to how to expedite this.)

For runtime comparison:

Using Access I can import, parse, summarize, and merge several files in this size-range in <5 mins (though I am right at its 2GB lim). I'd hope I can get comparable results in Python - presently I'm estimating ~30 min run time for one file. Note: I threw something together in Access' miserable environment only because I didn't have admin rights readily available to install anything else.

Edit: Updated parsing code. Was able to shave off five minutes (est. runtime at 25m) by changing some conditional logic to try/except. Also - runtime estimate doesn't include pandas portion - I'd forgotten I'd commented that out while testing, but its impact seems negligible.

riemannh
  • 35
  • 7
  • This is not python option. But have you considered the data.table package in R? – jazzurro Dec 06 '14 at 03:17
  • I haven't. This is sortof a downtime project for me at work - I'd like to create a executable post processing prog. than handles our simulation output. That is, something that won't require our other forecasters to install any other software on their computers. We have a very limited environment for security reasons. – riemannh Dec 06 '14 at 04:12
  • possible duplicate of [Reading a huge .csv in python](http://stackoverflow.com/questions/17444679/reading-a-huge-csv-in-python) – Jason S Dec 06 '14 at 18:16
  • or http://stackoverflow.com/questions/9087039/most-efficient-way-to-parse-a-large-csv-in-python – Jason S Dec 06 '14 at 18:17
  • Correct me if I'm wrong, but it appears the first covers only reading CSVs (I'll check out their method to see if it improves my read-time). The second recommends using the csv package for parsing, but I need to improve upon the parsing time that package seems to provide (unless I'm missing something...). Thanks for the pointers, though. – riemannh Dec 06 '14 at 19:25
  • I haven't implemented either of the two suggestions, but the first will result in by-row data that I will then need to join into a list or DF - I've had problems with this as list = list + newrow is very slow, hence the list comprehension mess above. Perhaps there is a better way? The second appears to, perhaps, have the same issue w/ the use of the .append function. I will still try adapting my code to use these methods and verify I haven't missed something. Thanks again. – riemannh Dec 06 '14 at 19:32
  • If you need high performance, perhaps Python is not the correct choice? – Mark Reed Dec 07 '14 at 01:42

1 Answers1

1

If you want to optimize performance, don't roll your own CSV reader in Python. There is already a standard csv module. Perhaps pandas or numpy have faster csv readers; I'm not sure.

From https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file:

In short, pandas.io.parsers.read_csv beats everybody else, NumPy's loadtxt is impressively slow and NumPy's from_file and load impressively fast.

Community
  • 1
  • 1
Jason S
  • 184,598
  • 164
  • 608
  • 970
  • Having trouble logging in, but I do recall that I used parse_dates=3 when trying this - I'll retry without this and try the parsing using the function above (post import) instead. – riemannh Dec 06 '14 at 19:53
  • I tried this import method again - parse_dates seems to cause the problem. Importing using this method seems to be very fast and as long as I use the Text to YearMo function above (post import), I can parse the dates as I need and in reasonable time, too. I'll have to dig and see how pandas wrote their import function and why it doesn't have the memory problems other methods seem to (if I'll understand it) to understand what I was doing wrong. Thank you for suggesting I revisit this method! – riemannh Dec 06 '14 at 20:20