Pandas dataframe replace em-dash with nan

Question

I am trying to read in a large number of .xls and .xlsx files with predominantly numeric data into python using pd.read_excel. However, the files use em-dash for missing values. I am trying to get Python to replace all these em-dashes as nans. I can't seem to find a way to get Python to even recognize the character, let alone replace it. I tried the following which did not work

df['var'].apply(lambda x: re.sub(u'\2014','',x))

I also tried simply

df['var'].astype('float')

What would be the best way to get all the em-dashs in a dataframe to convert to nans, while keeping the numeric data as floats?

Google a bit better mate because I found [this](https://stackoverflow.com/questions/17097236/how-to-replace-values-with-none-in-pandas-data-frame-in-python?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa) within 10 seconds. — dejoma, May 29 '18 at 18:53
Duplicate from [this post](https://stackoverflow.com/questions/17097236/how-to-replace-values-with-none-in-pandas-data-frame-in-python?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa). Solution is df.replace() — dejoma, May 29 '18 at 18:54
The issue is that I have a long dash, not a regular dash. So normal string find/replace functions aren't working. since Python is not matching '-' with '–'. When I keep getting errors like this: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 0: ordinal not in range(128) — LauraF, May 29 '18 at 19:20
`pd.read_excel` normally does not use ascii codecs. How exactly do you read the excel files? — DYZ, May 29 '18 at 19:28
And simply copy-pasting that symbol in the same way? So not a hyphen but the symbol you're looking for? — dejoma, Jun 06 '18 at 10:10

DYZ · Answer 1 · 2018-05-30T00:02:50.303

5

You should catch the error at an earlier stage. Tell pd.read_excel() to treat em-dashes as NaNs:

df = pd.read_excel(..., na_values=['–','—'])

edited May 30 '18 at 00:02

answered May 29 '18 at 18:53

DYZ

55,249
10
64
93

1

Please be specific. How exactly it did not work? Did you get an error message? – DYZ May 29 '18 at 19:10
The code ran without an error, but It didn't actually convert the dashes to nans – LauraF May 29 '18 at 21:13
Looks like I pasted an en-dash instead of an em-dash in my answer. Try again. – DYZ May 30 '18 at 00:03

score 1 · Answer 2 · answered May 29 '18 at 18:51

1

I think the most straightforward way to do this would be pd.to_numeric with the argument errors='coerce':

df['var'] = pd.to_numeric(df['var'], errors='coerce')

From the docs:

If ‘coerce’, then invalid parsing will be set as NaN

answered May 29 '18 at 18:51

sacuL

49,704
8
81
106

score 0 · Answer 3 · answered May 29 '18 at 21:12

Not sure exactly what was going on with those dashes (which showed up like u'\u2013' when I would do df.get_value(0,'var')) but I did find a solution that worked, which converted the dashes to nans and kept the numeric data as numbers.

import unicodedata

df['var']=df['var'].map(unicode)
df['var']=df['var'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
df['var']=pd.to_numeric(df['var'])

score -1 · Answer 4 · answered May 29 '18 at 18:56

-1

df.replace({'-': None}) is what you are looking for. Found in another post on stack overflow.

answered May 29 '18 at 18:56

dejoma

394
1
6
18

Pandas dataframe replace em-dash with nan

4 Answers4