I am currently working on a project where I have to load a larger .csv file (1/2 million lines), and do some error handling on each row.
So far I'm loading my .csv file into a "dataFrame" variable:
#Load the datafile into DataFrame
dataFrame = pd.read_csv(filename,header=None,
names=["year", "month", "day", "hour", "minute", "second", "zone1", "zone2", "zone3", "zone4"])
Then I'm running through each row in the dataFrame and doing my error handling such as:
#Check rows for corrupted measurements
for i in range(len(dataFrame)+1):
#Define the row
try:
row = np.array(dataFrame.iloc[i,:], dtype=object)
except IndexError:
continue
#If condition to check if there are corrupted measurements
if not -1 in row:
continue
#Check fmode, ignore upper- or lowercase
#foward fill
if fmode.lower() in fmodeStr[0]:
(Error handling)
elif fmode.lower() in fmodeStr[1]:
(Error handling)
elif fmode.lower() in fmodeStr[2]:
(Error handling)
Where fmode is just a string specifying what kind of error handling the user wants to do.
As of right now, the code works with a decent amount of lines (1000-5000). But when the .csv file has 1/2 million lines, it takes a really long time for it to go through. This is quite obvious since I'm looping through each row, of a 1/2 million row file.
I'm wondering what kind of solution would be the most efficient for loading a csv file of this size, and at the same time doing some operations on the individual rows?
So far I've looked into: - Making a generator function to load 1 row of the .csv file, handling that, and saving it in a numpy matrix
Loading the .csv file with chunksize option and concatenating in the end
Vector computation (However, error handling includes replacing corrupted lines with valid lines before or after the corrupted line)
Maybe you could do a combination of the above? Anyways, thank you for your time :)
For those who are interested / need more clarification, here is the full code: https://github.com/danmark2312/Project-Electricity/blob/test/functions/dataLoad.py