Replacing blank values (white space) with NaN in pandas

Question

I want to find all values in a Pandas dataframe that contain whitespace (any arbitrary amount) and replace those values with NaNs.

Any ideas how this can be improved?

Basically I want to turn this:

                   A    B    C
2000-01-01 -0.532681  foo    0
2000-01-02  1.490752  bar    1
2000-01-03 -1.387326  foo    2
2000-01-04  0.814772  baz     
2000-01-05 -0.222552         4
2000-01-06 -1.176781  qux

Into this:

                   A     B     C
2000-01-01 -0.532681   foo     0
2000-01-02  1.490752   bar     1
2000-01-03 -1.387326   foo     2
2000-01-04  0.814772   baz   NaN
2000-01-05 -0.222552   NaN     4
2000-01-06 -1.176781   qux   NaN

I've managed to do it with the code below, but man is it ugly. It's not Pythonic and I'm sure it's not the most efficient use of pandas either. I loop through each column and do boolean replacement against a column mask generated by applying a function that does a regex search of each value, matching on whitespace.

for i in df.columns:
    df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None

It could be optimized a bit by only iterating through fields that could contain empty strings:

if df[i].dtype == np.dtype('object')

But that's not much of an improvement

And finally, this code sets the target strings to None, which works with Pandas' functions like fillna(), but it would be nice for completeness if I could actually insert a NaN directly instead of None.

What you really want is to be able to use [`replace`](http://pandas.pydata.org/pandas-docs/dev/missing_data.html#missing-data-replace) with a regex... (perhaps this should be requested as a feature). — Andy Hayden, Nov 18 '12 at 23:23
I made a github issue for this feature: https://github.com/pydata/pandas/issues/2285 . Would be grateful for PRs! :) — Chang She, Nov 19 '12 at 00:00
For those that want to turn exactly a single blank character to missing, see [this simple solution below](https://stackoverflow.com/a/47105408/3707607) — Ted Petrou, Nov 03 '17 at 22:50

score 329 · Accepted Answer · edited Jan 06 '20 at 15:56

329

I think df.replace() does the job, since pandas 0.13:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

# replace field that's entirely space (or empty) with NaN
print(df.replace(r'^\s*$', np.nan, regex=True))

Produces:

                   A    B   C
2000-01-01 -0.532681  foo   0
2000-01-02  1.490752  bar   1
2000-01-03 -1.387326  foo   2
2000-01-04  0.814772  baz NaN
2000-01-05 -0.222552  NaN   4
2000-01-06 -1.176781  qux NaN

As Temak pointed it out, use df.replace(r'^\s+$', np.nan, regex=True) in case your valid data contains white spaces.

edited Jan 06 '20 at 15:56

smci

32,567
20
113
146

answered Feb 21 '14 at 18:48

patricksurry

5,508
2
27
38

cannot understand why `pd.Series(["1", "#", "9", " .", None]).replace(regex=r"( +\.)|#").values` gives `['1', '#', '9', ' .', ' .']` – Winand Nov 03 '15 at 12:13
1

regex is a boolean flag. Maybe you mean `pd.Series(["1", "#", "9", " .", None]).replace(r"( +\.)|#", "X", regex=True).values` which gives `['1', 'X', '9', 'X', None]` – patricksurry Nov 03 '15 at 16:27
2

2 years on, I've changed the accepted answer to this, now that pandas supports it. Thanks! – Chris Clark Nov 04 '15 at 19:50
@patricksurry regex may be string: `frame.replace(regex=r"( +\.)|#", value=pd.np.NaN, inplace=True)` – Winand Nov 05 '15 at 05:32
@winand wow, never spotted that. It's unclear from docs how `value=None` (or omitted) should behave but seems like `s.replace(regex=r"( +\.)|#").values` should be equal to `s.replace(r"( +\.)|#", regex=True)`. Maybe file a bug https://github.com/pydata/pandas/issues ? – patricksurry Nov 05 '15 at 12:09
44

**NOTE**: if you don't want an element containing space in the middle to be replaced with NaN use `df.replace(r'^\s+$', np.nan, regex=True)` – Temak Aug 01 '16 at 10:13
8

I tried to use this, but found out that r'^\s*$' should be the expression to use. without ^ and $ it will match any string with two consecutive blanks. Also changed + to * to include the empty string "" in the list of things to convert to NaN – Master Yogurt Nov 18 '16 at 17:36
2

I am trying your solution in my code, but it has no effect. I am trying "energy["Energy Supply"].replace(to_replace = "...", value = np.NaN)". Wanting to change the string "..." to NaN values, but it does nothing and returns the same dataframe. – CuriousLearner Jan 07 '18 at 10:41
@Archan Joshi, I realize this is a year ago, but you're not using inplace, which isn't reassigning it. – spen.smith May 12 '19 at 03:45
@ChrisClark: it's useful to note it was [pandas version 0.13 which allowed regex in `df.replace()`](https://pandas.pydata.org/pandas-docs/version/0.13/generated/pandas.DataFrame.replace.html?highlight=replace#pandas.DataFrame.replace). For all us looking at this years later, wondering "why are the old answers so weird?" – smci Jan 06 '20 at 15:57

Philipp Schwarz · Answer 2 · 2019-01-25T09:00:53.780

If you want to replace an empty string and records with only spaces, the correct answer is!:

df = df.replace(r'^\s*$', np.nan, regex=True)

The accepted answer

df.replace(r'\s+', np.nan, regex=True)

Does not replace an empty string!, you can try yourself with the given example slightly updated:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'fo o', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', ''],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

Note, also that 'fo o' is not replaced with Nan, though it contains a space. Further note, that a simple:

df.replace(r'', np.NaN)

Does not work either - try it out.

BrenBarn · Answer 3 · 2019-05-08T20:53:55.620

40

How about:

d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

The applymap function applies a function to every cell of the dataframe.

edited May 08 '19 at 20:53

answered Nov 18 '12 at 23:15

BrenBarn

242,874
37
412
384

What a nice improvement! I should have thought of this in retrospect, but got hung up on doing boolean replacements for some reason. One question - is there an advantage to doing the basestring check vs. just str(x).isspace()? – Chris Clark Nov 18 '12 at 23:50
1

@ChrisClark: Either one is fine, although I would guess that the `isinstance` will be a bit faster. – BrenBarn Nov 18 '12 at 23:55
15

The reference to "basestring" in the code above will not work in Python 3.... in that case, try using "str" instead. – Spike Williams Feb 09 '15 at 20:39
5

Note that this solution doesn't replace empty strings `''`. To consider also empty strings, use: `d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and (not x or x.isspace()) else x)` – tuomastik May 31 '17 at 12:34
In Python 3.x, the correct equivalent for `isinstance(x, basestring)` is `isinstance(x, str)` – KiriSakow Sep 03 '21 at 12:33

score 40 · Answer 4 · edited Apr 02 '21 at 09:43

40

I did this:

df = df.apply(lambda x: x.str.strip()).replace('', np.nan)

or

df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

You can strip all str, then replace empty str with np.nan.

edited Apr 02 '21 at 09:43

petezurich

9,280
9
43
57

answered Apr 29 '16 at 09:34

Xiaorong Liao

1,201
14
14

lambda x: x.str.strip() should be lambda x: x.strip()? minor suggestion: add .astype(str) in front, this solves other data issues for me. This works for me: df=df.apply['column'].astype(str).apply(lambda x: x.strip()).replace('', np.nan) – Wouter Sep 16 '17 at 21:09
1

The second line of code handles both int/float and string type columns. Nice. Tks! – Kate Stohr Mar 09 '19 at 17:34

score 14 · Answer 5 · edited Jan 07 '18 at 16:13

14

If you are exporting the data from the CSV file it can be as simple as this :

df = pd.read_csv(file_csv, na_values=' ')

This will create the data frame as well as replace blank values as Na

edited Jan 07 '18 at 16:13

ayhan

70,170
20
182
203

answered Jan 07 '18 at 16:07

ibrahim rupawala

171
2
7

7

Another option..using `skipinitialspace=True` also removes any whitespace after the delimiter which would cause any length of white space, empty strings to be read as `nan`. However if you want to retain the initial spaces for what ever reason then this option is not a good choice. – Rajshekar Reddy Oct 31 '18 at 03:31

score 7 · Answer 6 · answered Mar 22 '18 at 14:44

7

Simplest of all solutions:

df = df.replace(r'^\s+$', np.nan, regex=True)

answered Mar 22 '18 at 14:44

Gil Baggio

13,019
3
48
37

score 5 · Answer 7 · answered Nov 03 '17 at 22:48

5

For a very fast and simple solution where you check equality against a single value, you can use the mask method.

df.mask(df == ' ')

answered Nov 03 '17 at 22:48

Ted Petrou

59,042
19
131
136

score 3 · Answer 8 · answered Sep 29 '18 at 20:31

print(df.isnull().sum()) # check numbers of null value in each column

modifiedDf=df.fillna("NaN") # Replace empty/null values with "NaN"

# modifiedDf = fd.dropna() # Remove rows with empty values

print(modifiedDf.isnull().sum()) # check numbers of null value in each column

score 3 · Answer 9 · edited Nov 24 '20 at 06:48

3

This worked for me. When I import my csv file I added na_values = ' '. Spaces are not included in the default NaN values.

df= pd.read_csv(filepath,na_values = ' ')

edited Nov 24 '20 at 06:48

Joe Ferndz

8,417
2
13
33

answered May 08 '20 at 00:49

sambrowne

31
1

score 2 · Answer 10 · answered May 12 '19 at 04:05

These are all close to the right answer, but I wouldn't say any solve the problem while remaining most readable to others reading your code. I'd say that answer is a combination of BrenBarn's Answer and tuomasttik's comment below that answer. BrenBarn's answer utilizes isspace builtin, but does not support removing empty strings, as OP requested, and I would tend to attribute that as the standard use case of replacing strings with null.

I rewrote it with .apply, so you can call it on a pd.Series or pd.DataFrame.

Python 3:

To replace empty strings or strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, str) and (x.isspace() or not x) else x)

To replace strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, str) and x.isspace() else x)

To use this in Python 2, you'll need to replace str with basestring.

Python 2:

To replace empty strings or strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and (x.isspace() or not x) else x)

To replace strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

score 0 · Answer 11 · answered Jan 14 '19 at 05:02

0

This is not an elegant solution, but what does seem to work is saving to XLSX and then importing it back. The other solutions on this page did not work for me, unsure why.

data.to_excel(filepath, index=False)
data = pd.read_excel(filepath)

answered Jan 14 '19 at 05:02

David Kong

578
1
5
20

score 0 · Answer 12 · edited Oct 08 '20 at 11:51

0

This should work

df.loc[df.Variable == '', 'Variable'] = 'Value'

or

df.loc[df.Variable1 == '', 'Variable2'] = 'Value'

edited Oct 08 '20 at 11:51

General Grievance

4,555
31
31
45

answered Oct 08 '20 at 10:54

Sourav Dey

1

1

is Variable a column name? and also when quoted as a string? this seems succinct but the example is too terse to follow. – dcsan Oct 15 '21 at 22:23

score -3 · Answer 13 · answered Feb 01 '18 at 10:14

-3

you can also use a filter to do it.

df = PD.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '])
    df[df=='']='nan'
    df=df.astype(float)

answered Feb 01 '18 at 10:14

ERIC

185
1
2
10

2

Every line of this code (not including data) is faulty. – Julius Jan 14 '20 at 17:21

Replacing blank values (white space) with NaN in pandas

13 Answers13

Linked

Related