0

Consider the following list, my_list which has 3 DataFrames stored inside:

                DF1
fur_color    frequency    column_name
Black          9843         fur_color
Brown          8733         fur_color
White          3419         fur_color
BLACK          1277         fur_color
Tan            988          fur_color

                DF2
size         frequency    column_name
Small          8391          size
Medium         4730          size
Mdm            1322          size
L              4531          size
Large          5286          size

                      DF3
adoption_status    frequency    column_name
    Yes             11239      adoption_status
    Y               1532       adoption_status
    No              3588       adoption_status
 Scheduled          4127       adoption_status
   Sched.           3774       adoption_status

I would like to create a single DataFrame consisting of all the erroneous values from each DataFrame in the list.

Values like Tan (should just be Brown) or BLACK (should be Black) in the fur_color column, Mdm (should be Medium) and L (should be Large) in the size column, and Y (should be Yes) and Sched. (should be Scheduled) in the adoption_status column are all erroneous values.

Ideally, I would like to create a DataFrame that looks like this:

                       erroneous_values
error_value    frequency    column_name   expected_value
BLACK          1277          fur_color        Black
Tan            988           fur_color        Brown
Mdm            1322            size           Medium
L              4531            size           Large
Y              1532        adoption_status    Yes
Sched.         3774        adoption_status    Scheduled

For programmatic sake, please assume that the values in the frequency columns will be updated daily and can change.

I have some pre-defined criteria to make finding errors easy.

  • fur_color values should be in the list ["Black", "Brown", "White"]
  • size values should be in the list ["Small", "Medium", "Large"]
  • adoption_status values should be in the list ["Yes", "No", Scheduled]

I have been experimenting by using in/not in on the columns to check if they have values not in the acceptable values lists, however, I cannot find a good way to implement this as a DataFrame (it usually returns lists). Is there a scalable (assuming more DataFrames are added) and elegent solution to this problem?

Tom Hood
  • 497
  • 7
  • 16

1 Answers1

1

You can use a filter like this

df1[~df1['fur_color'].isin(['Black','White','Brown'])

The ~ says that the result of the filter df1['fur_color'].isin(['Black','White','Brown']) should be negated and only select the 'other'rows

For instance

df1 = pd.read_clipboard()
df1
  fur_color  frequency column_name
0     Black       9843   fur_color
1     Brown       8733   fur_color
2     White       3419   fur_color
3     BLACK       1277   fur_color
4       Tan        988   fur_color

df1_errors = df1[~df1['fur_color'].isin(['Black','White','Brown'])
df1_errors

  fur_color  frequency column_name
3     BLACK       1277   fur_color
4       Tan        988   fur_color
MichaelD
  • 1,274
  • 1
  • 10
  • 16