Replace certain value with nan in dataframe

Question

I have a data-frame:

, overall_score, industry_score
0, 15, -
1, 18, 12
2, - , 1
3, - , -
4, 12, 3

For some reason when I run:

print(df.isnull().sum())

It does not see the '-' as seen in index number 0, 2 and 3 as a nan value, how can I fix this? Because the - does actually mean that there's a missing datapoint.

Result of df.to_dict():

{' overall_score': {0: ' 15', 1: ' 18', 2: ' - ', 3: ' - ', 4: ' 12'}, ' industry_score': {0: ' -', 1: ' 12', 2: ' 1', 3: ' -', 4: ' 3'}}

If you have '-' you don't have numbers but strings. Use `pd.to_numeric` or read your data properly from the start. — Anton vBR, Jun 04 '18 at 15:05
@AntonvBR the data has been scraped from a website, I don't control how a website presents their data. — PEREZje, Jun 04 '18 at 15:07

piRSquared · Answer 1 · 2018-06-04T15:24:10.223

3

`read_csv`

Use the na_values argument to address the problem while parsing a file.

pd.read_csv('test.csv', na_values=['-'], index_col=0, sep='\s*\,\s*', engine='python')

   overall_score  industry_score
0           15.0             NaN
1           18.0            12.0
2            NaN             1.0
3            NaN             NaN
4           12.0             3.0

`mask`

This can be useful if columns are of dtype object and intended to stay so.

df.mask(lambda x: x == '-')

   overall_score  industry_score
0           15.0             NaN
1           18.0            12.0
2            NaN             1.0
3            NaN             NaN
4           12.0             3.0

edited Jun 04 '18 at 15:24

answered Jun 04 '18 at 15:10

piRSquared

285,575
57
475
624

Yes this works. However it makes no sense to not convert columns to numeric as we are dealing with scores (numbers). – Anton vBR Jun 04 '18 at 15:18
The **`read_csv`** answer addresses how to solve the problem ahead of time. The **`mask`** answer addresses when columns are strings and `-` is intended to be a null value. `to_numeric` might be very useful but in this context is a club assuming everything is meant to be numeric. Even if OP intended it so, it doesn't mean that everyone searching for an answer will have the same situation. – piRSquared Jun 04 '18 at 15:21
Ok I buy that and I had the same intention in my comments. But it should be included in your answer as an explanatory text. Don't you agree? – Anton vBR Jun 04 '18 at 15:22
Sure, it would improve the answer. – piRSquared Jun 04 '18 at 15:23

Anton vBR · Answer 2 · 2018-06-04T15:19:26.310

You say your data is scraped. But at some point it is read to a dataframe and in that reading process it would be more efficient to pass a dtype = 'float'.

But let us assume you took over that dataframe. In that case use df.apply(pd.to_numeric, errors='coerce') to convert your values to numbers (in this process non-valids, e.g. ´-´will be replaced with nan).

Full example:

import pandas as pd

data = '''\
overall_score,industry_score
15,-
18,12
-,1
-,-
12,3'''

df = pd.read_csv(pd.compat.StringIO(data), sep=',')
print(df.isnull().sum())

#overall_score     0
#industry_score    0
#dtype: int64

cols = ['overall_score', 'industry_score']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
print(df.isnull().sum())

#overall_score     2
#industry_score    2
#dtype: int64

This worked, but I do have some strings in my dataframe that should not be removed, is it possible to only affect certain columns? — PEREZje, Jun 04 '18 at 15:12

Replace certain value with nan in dataframe

2 Answers2

read_csv

mask

`read_csv`

`mask`