I am analyzing data from an industrial control system that records sensor data at 1-second intervals to a CSV file. We do analysis weekly, so the CSV file is large (49 columns x ~600k rows). Sometimes (probably due to power outages in the field), the CSV file has corrupt rows. We get rows where the number of columns is right but the data type is wrong. Here is a dummy DF that illustrates the issue in smaller scale:
mydict = {'colFloat': [1.5, 2.3, 'TRUE', 'FALSE', 3.5, 7.8], 'colBool': ['FALSE', 'TRUE', 3.2, 1.1, 'TRUE', 'TRUE'], 'colInt': [2, 2, 3.7, 9.9, 1, 4]}
df = pd.DataFrame(data = mydict)
In this example, 'colFloat' should be of dtype float64, 'colBool' should be of dtype bool, 'colInt' should be of dtype int64, but of course the two buggy rows in the middle cause all three columns to be of type object. Later, when I try and do groupbys and integrate sensor readings (for example from 'colFloat'), I get an error because the non-float data in the column cause an issue.
I'd like a method that I can run on the df that will remove the faulty rows and notify me that they exist. The resulting df in my example above would look like this df:
mydictclean = {'colFloat': [1.5, 2.3, 3.5, 7.8], 'colBool': ['FALSE', 'TRUE', 'TRUE', 'TRUE'], 'colInt': [2, 2, 1, 4]}
dfclean = pd.DataFrame(data = mydictclean)
I tried specifying the dtypes using the dtype parameter in the read_csv call, but that just throws an error when the function encounters the bad rows. Thank you for your help!