3

I have a data-frame:

, overall_score, industry_score
0, 15, -
1, 18, 12
2, - , 1
3, - , -
4, 12, 3

For some reason when I run:

print(df.isnull().sum())

It does not see the '-' as seen in index number 0, 2 and 3 as a nan value, how can I fix this? Because the - does actually mean that there's a missing datapoint.

Result of df.to_dict():

{' overall_score': {0: ' 15', 1: ' 18', 2: ' - ', 3: ' - ', 4: ' 12'}, ' industry_score': {0: ' -', 1: ' 12', 2: ' 1', 3: ' -', 4: ' 3'}}
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
PEREZje
  • 2,272
  • 3
  • 9
  • 23

2 Answers2

3

read_csv

Use the na_values argument to address the problem while parsing a file.

pd.read_csv('test.csv', na_values=['-'], index_col=0, sep='\s*\,\s*', engine='python')

   overall_score  industry_score
0           15.0             NaN
1           18.0            12.0
2            NaN             1.0
3            NaN             NaN
4           12.0             3.0

mask

This can be useful if columns are of dtype object and intended to stay so.

df.mask(lambda x: x == '-')

   overall_score  industry_score
0           15.0             NaN
1           18.0            12.0
2            NaN             1.0
3            NaN             NaN
4           12.0             3.0
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Yes this works. However it makes no sense to not convert columns to numeric as we are dealing with scores (numbers). – Anton vBR Jun 04 '18 at 15:18
  • The **`read_csv`** answer addresses how to solve the problem ahead of time. The **`mask`** answer addresses when columns are strings and `-` is intended to be a null value. `to_numeric` might be very useful but in this context is a club assuming everything is meant to be numeric. Even if OP intended it so, it doesn't mean that everyone searching for an answer will have the same situation. – piRSquared Jun 04 '18 at 15:21
  • Ok I buy that and I had the same intention in my comments. But it should be included in your answer as an explanatory text. Don't you agree? – Anton vBR Jun 04 '18 at 15:22
  • Sure, it would improve the answer. – piRSquared Jun 04 '18 at 15:23
1

You say your data is scraped. But at some point it is read to a dataframe and in that reading process it would be more efficient to pass a dtype = 'float'.

But let us assume you took over that dataframe. In that case use df.apply(pd.to_numeric, errors='coerce') to convert your values to numbers (in this process non-valids, e.g. ´-´will be replaced with nan).

Full example:

import pandas as pd

data = '''\
overall_score,industry_score
15,-
18,12
-,1
-,-
12,3'''

df = pd.read_csv(pd.compat.StringIO(data), sep=',')
print(df.isnull().sum())

#overall_score     0
#industry_score    0
#dtype: int64

cols = ['overall_score', 'industry_score']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
print(df.isnull().sum())

#overall_score     2
#industry_score    2
#dtype: int64
Anton vBR
  • 18,287
  • 5
  • 40
  • 46