0

I know how to check if a column in pandas has a specific string, like it's explained in the post Check if certain value is contained in a dataframe column in pandas. However I want to count the number of rows containing a specific string with some variability. For instance, I want to check not only if the row contains Portugal, but also if it contains PORTUGAL or portugal. Is there a way of doing this?

This is where I stopped (I tried to not only count but also see the %):

df[df['column'].str.contains('Portugal')].shape[0]/df['column'].shape[0]
Dumb ML
  • 357
  • 2
  • 12

4 Answers4

3
  • It's easier to cast the entire column to a single case, lowercase for example, and search for one variant.
    • This is also beneficial for further types of NLP analysis.
    • Other cases include:
      1. .capitalize: 'Portugal'
      2. .upper: 'PORTUGAL'
  • The solution by YOBEN_S should be used for instances where it's not desirable to convert the entire column to one case.
import pandas as pd

# test data
data = {'Country': ['PORTUGAL', 'ENGLAND', 'FRANCE', 'GERMANY', 'Portugal', 'SPAIN', 'SPAIN', 'portugal', 'ITALY', 'NETHERLANDS', 'PORTUGAL', 'ITALY', 'RUSSIA']}

# setup dataframe
df = pd.DataFrame(data)

# cast Country to lowercase
df['Country'] = df['Country'].str.lower()

# search for desired string with contains
portugal = df[df['Country'].str.contains('portugal')]

# display(portugal)
     Country
0   portugal
4   portugal
7   portugal
10  portugal
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
3

You can pass case=False

sub = df[df['Country'].str.contains('portugal',case=False)]
sub
Out[48]: 
     Country
0   PORTUGAL
4   Portugal
7   portugal
10  PORTUGAL
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Both @Trenton McKinney and @ YOBEN_S will do. Another pythonic way though is to Please use ?aiLmsux: regex flags. In this case insinuate the case insensitive flag i. it doesnt matter how portugal is typed provided the spelling is correct.

df[df.Country.str.contains('(?i:Portugal)')]



 Country
0   PORTUGAL
4   Portugal
7   portugal
10  PORTUGAL
wwnde
  • 26,119
  • 6
  • 18
  • 32
0

You can create nested if statements to check the case sensitivity.

As default pandas is case sensitive so if you search for 'PORTUGAL' and not for 'portugal' you can get the wanted behavior.

Ege
  • 515
  • 5
  • 14