0

I have a dataframe that contains county and state information (ie: Bucks, PA). I am trying to validate that my dataframe is correct and if not (as it often isn't spelled correctly), fix the mistakes. I have a separate list that contains all county and state information for the USA to use for against my dataframe for the validation.

This is what I have accomplished so far:

df_CountyState_List= ['Floyd, IA', 'Franklin, IA', 'Fremont, IA', 'Greene, IA' ] 
df_results = ['Franklin, IA', 'Freemont, IA', 'Green, IA'] 
desired df_results=[ 'Franklin, IA', 'Fremont, IA', 'Greene, IA' ]

for updatelocation in df_CountyState_List['Location'].to_list():
  df_results.loc[ df_results['Location'].str.contains(updatelocation), 'Location' ] ==updatelocation
  • Do you want to check if the values in each cell are "exactly" the same as one from your list? Or whether the values in each cell contain some value form the list (but maybe something else)? For the former, you can use `pandas.DataFrame.isin()`. – Ignatius Reilly Jun 06 '22 at 19:27
  • df_CountyState_List= ['Floyd, IA', 'Franklin, IA', 'Fremont, IA', 'Greene, IA' ] df_results = ['Franklin, IA', 'Freemont, IA', 'Green, IA'] desired df_results=[ 'Franklin, IA', 'Fremont, IA', 'Greene, IA' ] – Olivia Steinke Jun 06 '22 at 19:39

1 Answers1

0

If you only want to validate, i.e. to check whether the rows are OK or not, you can use isin().

Let's say your column is called "County", then:

df_results.County.isin(df_CountyState_List[0])

0     True
1    False
2    False
Name: County, dtype: bool

If you want also to correct the mistakes, you can use difflib as suggested in another answer. difflib requires a list, not a dataframe, so if df_CountyState_List is a DataFrame (as the 'df' in the name suggest):

countyState_List = df_CountyState_List[0].tolist()
df_validated = df_results.County.apply(lambda row: difflib.get_close_matches(row, countyState_List)[0])
Ignatius Reilly
  • 1,594
  • 2
  • 6
  • 15