0

I have a very lengthy dataset, that is stored as a dataframe. The column I am looking at is called "Country". This column has quite a few countries within it. The issue is that I want to change various values to "USA". The values I am trying to change are U.S United States United states etc. There are too many variations and typos (more than 100) for me to go through. Is there any simpler way to change these values? Since there are other countries in the dataset, I cannot just change all the values to USA

hed
  • 15
  • 7
  • 1
    Include a [reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – sushanth Jul 08 '20 at 18:54
  • Have you tried fuzzywuzzy for fuzzy matching? I would start trying to match US, USA, United States and the map those the values to USA, or whatever verspinn you have. – divingTobi Jul 08 '20 at 19:05

1 Answers1

1

One of thing you can do is to stick to the first letter of each word. For all of the instance the first letter is U and for the second part (if you split the whole string) is S. Here, I am using regular expressions package that is usually used when you are working with texts.

Import re   
Split_parts = [re.split(r'[^A-Z,^a-z]', i) for i in df['country']]

The above line of code splits the string based on any none alphabetic character (e.g. period, comma, semicolon, etc.). after splitting you can create a for loop that generates True, False elements if the first characters are U and S respectively.

value= []
for i in Split_parts:
    if i[0][0] in ['u','U'] and  i[1][0] in ['s','S']:
        value.append(True)
    else:
        value.append(False)

After that you can replace the string with what you need (i.e. USA):

for i in range(len(value)):
    if value[i]==True:
        df['country'][i]='USA'

The only country in world that has U and S as the first letters of its words respectively is United States. The solution here is not something that can be used for all problems you may face. For each one you have to look for differences.

Cicilio
  • 413
  • 6
  • 12