1

I am analyzing data from an industrial control system that records sensor data at 1-second intervals to a CSV file. We do analysis weekly, so the CSV file is large (49 columns x ~600k rows). Sometimes (probably due to power outages in the field), the CSV file has corrupt rows. We get rows where the number of columns is right but the data type is wrong. Here is a dummy DF that illustrates the issue in smaller scale:

mydict = {'colFloat': [1.5, 2.3, 'TRUE', 'FALSE', 3.5, 7.8], 'colBool': ['FALSE', 'TRUE', 3.2, 1.1, 'TRUE', 'TRUE'], 'colInt': [2, 2, 3.7, 9.9, 1, 4]}
df = pd.DataFrame(data = mydict)

In this example, 'colFloat' should be of dtype float64, 'colBool' should be of dtype bool, 'colInt' should be of dtype int64, but of course the two buggy rows in the middle cause all three columns to be of type object. Later, when I try and do groupbys and integrate sensor readings (for example from 'colFloat'), I get an error because the non-float data in the column cause an issue.

I'd like a method that I can run on the df that will remove the faulty rows and notify me that they exist. The resulting df in my example above would look like this df:

mydictclean = {'colFloat': [1.5, 2.3, 3.5, 7.8], 'colBool': ['FALSE', 'TRUE', 'TRUE', 'TRUE'], 'colInt': [2, 2, 1, 4]}
dfclean = pd.DataFrame(data = mydictclean)

I tried specifying the dtypes using the dtype parameter in the read_csv call, but that just throws an error when the function encounters the bad rows. Thank you for your help!

Sebastien
  • 23
  • 3

4 Answers4

0

I would try to catch all type errors AND CONVERT at the same time, in all columns.

See How do I check if a string is a number (float)? for the first function:

def is_number(n):
is_number = True
try:
    num = float(n)
    # check for "nan" floats
    is_number = num == num   # or use `math.isnan(num)`
except ValueError:
    is_number = False
return is_number

Then your own:

def check_and_convert_int(x):
    if x.is_integer():
        return int(x)
    else:
        return np.nan

def check_and_convert_bool(x):
    if x in [True, 'True', 'TRUE']:
        return True
    elif x in [False, 'False', 'FALSE']:
        return False
    else:
        return np.nan

def check_and_convert_float(x):
    if is_number(x):
        return float(x)
    else:
        return np.nan

Then I would do :

df['int_col_clean'] = df['int_col'].apply(lambda x: check_and_convert_int(x))
df['bool_col_clean'] = df['bool_col'].apply(lambda x: check_and_convert_bool(x))        
df['float_col_clean'] = df['float_col'].apply(lambda x: check_and_convert_float(x))

Inspect the columns that may be dropped using :

df[df.isna().any(axis=1)]

If everything is fine, drop them :

df = df.dropna(subset = ['int_col_clean', 'bool_col_clean', 'float_col_clean']) 
Vincent
  • 1,534
  • 3
  • 20
  • 42
  • Thank you @Vincent! I mostly used @mozway to solve my issue, but adapted it to use the check_and_convert_bool(x) function you provided as it avoids an error when the column is already all bool. Thank you again! – Sebastien Jul 18 '22 at 12:09
0

I had to change the boolean values because the way they are right now they are strings:

mydict = {'colFloat': [1.5, 2.3,'TRUE', 'FALSE', 3.5, 7.8], 'colBool': [False, True, 3.2, 1.1, True, True], 'colInt': [2, 2, 3.7, 9.9, 1, 4]}
df = pd.DataFrame(data = mydict)

Then you can filter based on the instance type:

df = df[df['colFloat'].apply(isinstance, args=[float])]
df = df[df['colBool'].apply(isinstance, args=[bool])]
df['colInt'] = df['colInt'].astype(int)
df = df[df['colInt'].apply(isinstance, args=[int])]
df

Output:

    colFloat    colBool colInt
0   1.5         False   2
1   2.3         True    2
4   3.5         True    1
5   7.8         True    4
Lorenzo Bonetti
  • 530
  • 3
  • 12
0

You can use a set of converter, then drop the rows with NaNs:

types = {'colFloat': 'numeric', 'colInt': 'numeric', 'colBool': 'boolean'}
converter = {'numeric': lambda s: pd.to_numeric(s, errors='coerce'),
             'boolean': lambda s: s.str.upper().map({'TRUE': True, 'FALSE': False})
            }

dfclean = pd.DataFrame(index=df.index)
for c, t in types.items():
    dfclean[c] = converter[t](df[c])
    
dfclean = dfclean.dropna()
print(dfclean)

output:

   colFloat  colInt colBool
0       1.5     2.0   False
1       2.3     2.0    True
4       3.5     1.0    True
5       7.8     4.0    True

NB. regarding the int/float, it is not always easy to maintain this with pandas. You can use convert_dtypes to perform automated (safe) float -> int conversion in case all values have no decimal part.

mozway
  • 194,879
  • 13
  • 39
  • 75
  • 1
    Thank you @mozway! I found that in my code, columns that were all bool already before running this solution would create an error. But I was able to adapt your solution to use the def check_and_convert_bool(x) function from @vincent to get a full solution. – Sebastien Jul 18 '22 at 12:08
-1

try this:

df.loc[(df.colFloat.apply(lambda x: type(x) == float)) ]

but your boolcol is string not bool.

bpfrd
  • 945
  • 3
  • 11