How to lowercase a pandas dataframe string column if it has missing values?

Question

The following code does not work.

import pandas as pd
import numpy as np
df=pd.DataFrame(['ONE','Two', np.nan],columns=['x']) 
xLower = df["x"].map(lambda x: x.lower())

How should I tweak it to get xLower = ['one','two',np.nan] ? Efficiency is important since the real data frame is huge.

From v0.25 onwards, I recommend `str.casefold` for more aggressive case folding string comparisons. More information in [this answer](https://stackoverflow.com/a/56084280/4909087). — cs95, May 10 '19 at 20:18

behzad.nouri · Accepted Answer · 2014-12-01T23:56:21.020

289

use pandas vectorized string methods; as in the documentation:

these methods exclude missing/NA values automatically

.str.lower() is the very first example there;

>>> df['x'].str.lower()
0    one
1    two
2    NaN
Name: x, dtype: object

edited Dec 01 '14 at 23:56

answered Mar 07 '14 at 10:30

behzad.nouri

74,723
18
126
124

interestingly this is slower than the map method in the other answer `10000 loops, best of 3: 96.4 µs per loop` versus `10000 loops, best of 3: 125 µs per loop` – EdChum Mar 07 '14 at 10:44
1

@EdChum that is not surprising with only 3 elements; but it wouldn't be the case with say just 100 elements; – behzad.nouri Mar 07 '14 at 10:57
1

@behzad.nouri I tried df1['comment'] = df1['comment'].str.lower() but got error KeyError: 'comment' everythime. I checked - I have column named exaclty the same. What can cause an error? – Katya Jan 02 '20 at 16:02

Mike W · Answer 2 · 2019-05-17T06:40:33.170

Another possible solution, in case the column has not only strings but numbers too, is to use astype(str).str.lower() or to_string(na_rep='') because otherwise, given that a number is not a string, when lowered it will return NaN, therefore:

import pandas as pd
import numpy as np
df=pd.DataFrame(['ONE','Two', np.nan,2],columns=['x']) 
xSecureLower = df['x'].to_string(na_rep='').lower()
xLower = df['x'].str.lower()

then we have:

>>> xSecureLower
0    one
1    two
2   
3      2
Name: x, dtype: object

and not

>>> xLower
0    one
1    two
2    NaN
3    NaN
Name: x, dtype: object

edit:

if you don't want to lose the NaNs, then using map will be better, (from @wojciech-walczak, and @cs95 comment) it will look something like this

xSecureLower = df['x'].map(lambda x: x.lower() if isinstance(x,str) else x)

Thanks, man! I forgot about NaNs, I just corrected the answer — Mike W, May 17 '19 at 06:37

score 13 · Answer 3 · edited May 02 '23 at 07:26

Pandas >= 0.25: Remove Case Distinctions with `str.casefold`

Starting from v0.25, I recommend using the "vectorized" string method str.casefold if you're dealing with unicode data (it works regardless of string or unicodes):

s = pd.Series(['lower', 'CAPITALS', np.nan, 'SwApCaSe'])
s.str.casefold()

0       lower
1    capitals
2         NaN
3    swapcase
dtype: object

Also see related GitHub issue GH25405.

casefold lends itself to more aggressive case-folding comparison. It also handles NaNs gracefully (just as str.lower does).

But why is this better?

The difference is seen with unicodes. Taking the example in the python str.casefold docs,

Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter 'ß' is equivalent to "ss". Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to "ss".

Compare the output of lower for,

s = pd.Series(["der Fluß"])
s.str.lower()

0    der fluß
dtype: object

Versus casefold,

s.str.casefold()

0    der fluss
dtype: object

Also see lower() vs. casefold() in string matching and converting to lowercase.

score 12 · Answer 4 · answered Feb 07 '19 at 02:06

12

you can try this one also,

df= df.applymap(lambda s:s.lower() if type(s) == str else s)

answered Feb 07 '19 at 02:06

Farid

169
1
8

4

`type(s) == str` should instead be `isinstance(s, str)` – cs95 May 17 '19 at 05:24

aravinda_gn · Answer 5 · 2020-06-05T02:46:40.757

10

Apply lambda function

df['original_category'] = df['original_category'].apply(lambda x:x.lower())

edited Jun 05 '20 at 02:46

answered Apr 13 '20 at 06:05

aravinda_gn

1,263
1
11
20

score 8 · Answer 6 · edited May 19 '19 at 06:11

8

A possible solution:

import pandas as pd
import numpy as np

df=pd.DataFrame(['ONE','Two', np.nan],columns=['x']) 
xLower = df["x"].map(lambda x: x if type(x)!=str else x.lower())
print (xLower)

And a result:

0    one
1    two
2    NaN
Name: x, dtype: object

Not sure about the efficiency though.

edited May 19 '19 at 06:11

cs95

379,657
97
704
746

answered Mar 07 '14 at 08:43

Wojciech Walczak

3,419
2
23
24

Same as the other answer, use `isinstance` when checking the type of an object. – cs95 May 17 '19 at 05:43

score 2 · Answer 7 · answered Apr 11 '19 at 07:25

2

May be using List comprehension

import pandas as pd
import numpy as np
df=pd.DataFrame(['ONE','Two', np.nan],columns=['Name']})
df['Name'] = [str(i).lower() for i in df['Name']] 

print(df)

answered Apr 11 '19 at 07:25

Andre_k

1,680
3
18
41

score 1 · Answer 8 · edited Mar 12 '20 at 19:20

1

copy your Dataframe column and simply apply

df=data['x']
newdf=df.str.lower()

edited Mar 12 '20 at 19:20

sentence

8,213
4
31
40

answered Mar 29 '18 at 12:24

Ch HaXam

499
3
16

score 0 · Answer 9 · edited Mar 12 '20 at 19:20

0

Use apply function,

Xlower = df['x'].apply(lambda x: x.upper()).head(10)

edited Mar 12 '20 at 19:20

sentence

8,213
4
31
40

answered Oct 16 '19 at 10:10

Ashutosh Shankar

9
1

2

As the Efficiency is important for the user `(Efficiency is important since the real data frame is huge.)` and there are a few more replies, please, try to expose which one is the good point of your answer. – David García Bodego Oct 16 '19 at 10:35

mhc · Answer 10 · 2022-07-29T00:43:52.793

Replace missing values and any other datatype with empty string, and lowercase all the strings:

df["x"] = df["x"].apply(lambda x: x.lower() if isinstance(x, str) else "")

Replace missing values and any other datatype other than string with nan, and lowercase all the strings:

df["x"] = df["x"].apply(lambda x: x.lower() if isinstance(x, str) else np.nan)

Keep nan and any other datatype other than string as they are, and lowercase all the strings:

df["x"] = df["x"].apply(lambda x: x.lower() if isinstance(x, str) else x)

Instead of apply you can also use map

In terms of speed, they are almost the same as df["x"] = df["x"].str.lower() and df["x"] = df["x"].str.lower(). But with apply/map you can handle the missing values as you want.

I tested the speed for one million strings. 10% of which are nan and the remaining are of length 50.

Data generation:

Speed Comparison:

How to lowercase a pandas dataframe string column if it has missing values?

10 Answers10

Pandas >= 0.25: Remove Case Distinctions with `str.casefold`

But why is this better?

Apply lambda function

Linked

Related

How to lowercase a pandas dataframe string column if it has missing values?

10 Answers10

Pandas >= 0.25: Remove Case Distinctions with str.casefold

But why is this better?

Apply lambda function

Linked

Related

Pandas >= 0.25: Remove Case Distinctions with `str.casefold`