Consider the following list, my_list
which has 3 DataFrames stored inside:
DF1
fur_color frequency column_name
Black 9843 fur_color
Brown 8733 fur_color
White 3419 fur_color
BLACK 1277 fur_color
Tan 988 fur_color
DF2
size frequency column_name
Small 8391 size
Medium 4730 size
Mdm 1322 size
L 4531 size
Large 5286 size
DF3
adoption_status frequency column_name
Yes 11239 adoption_status
Y 1532 adoption_status
No 3588 adoption_status
Scheduled 4127 adoption_status
Sched. 3774 adoption_status
I would like to create a single DataFrame consisting of all the erroneous values from each DataFrame in the list.
Values like Tan
(should just be Brown) or BLACK
(should be Black) in the fur_color column, Mdm
(should be Medium) and L
(should be Large) in the size column, and Y
(should be Yes) and Sched.
(should be Scheduled) in the adoption_status column are all erroneous values.
Ideally, I would like to create a DataFrame that looks like this:
erroneous_values
error_value frequency column_name expected_value
BLACK 1277 fur_color Black
Tan 988 fur_color Brown
Mdm 1322 size Medium
L 4531 size Large
Y 1532 adoption_status Yes
Sched. 3774 adoption_status Scheduled
For programmatic sake, please assume that the values in the frequency columns will be updated daily and can change.
I have some pre-defined criteria to make finding errors easy.
- fur_color values should be in the list ["Black", "Brown", "White"]
- size values should be in the list ["Small", "Medium", "Large"]
- adoption_status values should be in the list ["Yes", "No", Scheduled]
I have been experimenting by using in
/not in
on the columns to check if they have values not in the acceptable values lists, however, I cannot find a good way to implement this as a DataFrame (it usually returns lists). Is there a scalable (assuming more DataFrames are added) and elegent solution to this problem?