0

I have to get rid of rows in a dataset that do not contain states from the list:

state_names = ["Alaska", "Alabama", "Arkansas", "American Samoa", "Arizona", "California", "Colorado", "Connecticut", "District of Columbia", "Delaware", "Florida"......]

The column name that would contain a state name is called "place," however, it is not just the state name. For example, I want to keep the row that contains "26km S of Redoubt Volcano, Alaska" but I want to get rid of the row that contains "10km WNW of Progreso, Mexico" in the "place" column. How would I go about doing this using dropna()?

Anna Botts
  • 23
  • 6

3 Answers3

0

I don't think .dropna() will be enough to do the job.

import pandas as pd


df = pd.DataFrame({"place": ["26km S of Redoubt Volcano, Alaska", "10km WNW of Progreso, Mexico"]})

state_names = [
    "Alaska",
    "Alabama",
    "Arkansas",
    "American Samoa",
    "Arizona",
    "California",
    "Colorado",
    "Connecticut",
    "District of Columbia",
    "Delaware",
]

df["contain_state_names"] = df["place"].apply(
    lambda x: any([word for word in state_names if word in x])
)
print(df)

filtered = df.loc[df["contain_state_names"]]

print(filtered)

will print:

                               place  contain_state_names
0  26km S of Redoubt Volcano, Alaska                 True
1       10km WNW of Progreso, Mexico                False

                               place  contain_state_names
0  26km S of Redoubt Volcano, Alaska                 True

A quick explanation:

0

Use Series.str.contains to match substrings in string, if your data is small and not concerned with the performance. Otherwise look here.

Example:

import pandas as pd
df = pd.DataFrame({"place": ["26km S of Redoubt Volcano, Alaska", "10km WNW of Progreso, Mexico", "Arkansassawbad", "Hotel California Royale", "Something Wrong"]})

state_names = [
    "Alaska",
    "Alabama",
    "Arkansas",
    "American Samoa",
    "Arizona",
    "California",
    "Colorado",
    "Connecticut",
    "District of Columbia",
    "Delaware",
]
df = df.loc[df["place"].str.contains('|'.join(state_names))]

Output:

                               place
0  26km S of Redoubt Volcano, Alaska
2                     Arkansassawbad
3            Hotel California Royale

Explanation:

Series.str.contains does a regex match by default. Since we are looking for multiple words to be present in a column, we build the regex by joining words using | (OR operand) of regex.

Reference:

Select by partial string from a pandas DataFrame

รยקคгรђשค
  • 1,919
  • 1
  • 10
  • 18
0

Without dropna()

Although this first solution does not use the dropna() method it accomplishes the same goal cleanly.

An example dataframe based from your description:

import numpy as np
import pandas as pd

In  [1]: state_names = ["Alaska", "Alabama"] # substrings to find
        df = pd.DataFrame(data={'place': ["10km WNW of Progreso, Mexico","26km S of Redoubt Volcano, Alaska"]})
        df
Out [1]: 
    place
0   10km WNW of Progreso, Mexico
1   26km S of Redoubt Volcano, Alaska

The idea is to slice using the pandas string and the regex capability, specifically joining the list states with '|' which in regex terms means "Alaska" or "Alabama" ... This allows us to only select rows with a state as a substring in the place column:

In  [2]: df = df[df.place.str.contains('|'.join(state_names))]
         df
Out [2]: 
    place
1   26km S of Redoubt Volcano, Alaska

With dropna()

This is one way to use dropna() to accomplish this although I find the first method simpler as this involves the extra (and unneeded) step of assigning null values to the cells that do not contain a state and then drop rows based on the "place" columns' null values:

In  [3]: df.loc[~df.place.str.contains('|'.join(state_names), na=False), 'place'] = np.nan
         df.dropna(subset=['place'])
Out [3]:
    place
1   26km S of Redoubt Volcano, Alaska
John Volk
  • 49
  • 3