1

I am currently working on a project where I have to load a larger .csv file (1/2 million lines), and do some error handling on each row.

So far I'm loading my .csv file into a "dataFrame" variable:

#Load the datafile into DataFrame
dataFrame = pd.read_csv(filename,header=None,
    names=["year", "month", "day", "hour", "minute", "second", "zone1", "zone2", "zone3", "zone4"])

Then I'm running through each row in the dataFrame and doing my error handling such as:

#Check rows for corrupted measurements
for i in range(len(dataFrame)+1):

    #Define the row
    try:    
        row = np.array(dataFrame.iloc[i,:], dtype=object)
    except IndexError:
        continue

    #If condition to check if there are corrupted measurements
    if not -1 in row:
        continue

    #Check fmode, ignore upper- or lowercase
    #foward fill
    if fmode.lower() in fmodeStr[0]:
        (Error handling)

    elif fmode.lower() in fmodeStr[1]:
        (Error handling)

    elif fmode.lower() in fmodeStr[2]:
        (Error handling)

Where fmode is just a string specifying what kind of error handling the user wants to do.

As of right now, the code works with a decent amount of lines (1000-5000). But when the .csv file has 1/2 million lines, it takes a really long time for it to go through. This is quite obvious since I'm looping through each row, of a 1/2 million row file.

I'm wondering what kind of solution would be the most efficient for loading a csv file of this size, and at the same time doing some operations on the individual rows?

So far I've looked into: - Making a generator function to load 1 row of the .csv file, handling that, and saving it in a numpy matrix

  • Loading the .csv file with chunksize option and concatenating in the end

  • Vector computation (However, error handling includes replacing corrupted lines with valid lines before or after the corrupted line)

Maybe you could do a combination of the above? Anyways, thank you for your time :)

For those who are interested / need more clarification, here is the full code: https://github.com/danmark2312/Project-Electricity/blob/test/functions/dataLoad.py

  • A generator would actually be *slower*. You should probably start by not looping over a range, and rather, using `df.iterrows` or the faster `df.itertuples`. Also, why are you putting things in `np.array`??? – juanpa.arrivillaga Oct 12 '17 at 20:48
  • Also, you say that you are working with *text*. Why are you using `numpy.ndarray` objects, then? – juanpa.arrivillaga Oct 12 '17 at 20:48
  • Your question seems like you want a solution that improves serial processing so I'll leave this as a comment. You should use the multiprocessing library and launch a process for each available thread on your computer. There are different ways to partition your data. You can load the dataframe into a shared queue or pass each Process its own partition of the data (I recommend the first so that you don't have any idle threads). If you want me to write this out as an answer let me know. – Matt S Oct 12 '17 at 20:56
  • Juanpa, I'm putting them in an np.array because I'm gonna use it for some matrix calculations later on in another function, would it be quicker if I diden't make np.arrays to begin with, but only when I needed them? – Simon Moe Sørensen Oct 12 '17 at 22:14
  • The solution I was going to write out already exists: https://stackoverflow.com/questions/26784164/pandas-multiprocessing-apply. This will make it a few times faster depending on the type of processor you have. – Matt S Oct 13 '17 at 01:35

0 Answers0