0

I have a DataFrame that's read in from a csv. The data has various problems. The one i'm concerned about for this post is that some data is not in the column it should be. For example, '900' is in the zipcode column, or 'RQ' is in the langauge column when it should be in the nationality column. In some cases, these "misinsertions" are just anomalies and can be converted to NaN. In other cases they indicate that the values have shifted one column to the right or the left such that the whole row has missinserted data. I want to remove these shifted lines from the DataFrame and try to fix them separately. My proposed solution has been to keep track of the number of bad values in each row as I am cleaning each column. Here is an example with the zipcode column:

def is_zipcode(value: str, regx):
    if regx.match(value):
        return value
    else:
        return nan
regx = re.compile("^[0-9]{5}(?:-[0-9]{4})?$")
df['ZIPCODE'] = df['ZIPCODE'].map(lambda x: is_zipcode(x, regx), na_action='ignore')

I'm doing something like this on every column in the dataframe depending on the data in that column, e.g. for the 'Nationality' column i'll look up the values in a json file of nationality codes.

What I haven't been able to achieve is to keep count of the bad values in a row. I tried something like this:

def is_zipcode(value: str, regx):
    if regx.match(value):
        return 0
    else:
        return 1
regx = re.compile("^[0-9]{5}(?:-[0-9]{4})?$")
df['badValues'] = df['ZIPCODE'].map(lambda x: is_zipcode(x, regx), na_action='ignore')
df['badValues'] = df['badValues'] + df['Nationalities'].map(is_nationality, na_action='ignore) # where is_nationality() similarly returns 1 if it is a bad value

And this can work to keep track of the bad values. What I'd like to do is somehow combine the process of cleaning the data and getting the bad values. I'd love to do something like this:

def is_zipcode(value: str, regx):
    if regx.match(value):
        # add 1 to the value of df['badValues'] at the corresponding index
        return value
    else:
        return nan

The problem is that I don't think it's possible to access the index of the value being passed to the map function. I looked at these two questions (one, two) but I didn't see a solution to my issue.

1 Answers1

1

I guess this would do what you want ...

is_zipcode_mask = df['ZIPCODE'].str.match(regex_for_zipcode)

print(len(df[is_zipcode_mask]))
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • Ooo, this is showing how much of a noob I am. I'll look into masking. Only issue is I'm now getting this `ValueError: Cannot mask with non-boolean array containing NA/NaN values` – студент001 Jul 21 '22 at 02:37
  • try and print is_zipcode_mask ... I was expecting a list of True False ... see https://pandas.pydata.org/docs/reference/api/pandas.Series.str.match.html – Joran Beasley Jul 21 '22 at 03:56
  • Can confirm that is_zipcode_mask is a list of true and false. I tried running `is_zipcode_mask = df['ZIPCODE'].str.match(regex_for_zipcode, na='')` and now I get a KeyError: None of [Index([True, True, True...)] are in the [columns]. – студент001 Jul 21 '22 at 12:21
  • Also, the goal is to simultaneously clean the data and keep track of the rows that were cleaned. – студент001 Jul 21 '22 at 13:20
  • to see the rows where the condition fails you can do `df[~is_zipcode_mask]` – Joran Beasley Jul 22 '22 at 20:28