1

I am trying to read in a large number of .xls and .xlsx files with predominantly numeric data into python using pd.read_excel. However, the files use em-dash for missing values. I am trying to get Python to replace all these em-dashes as nans. I can't seem to find a way to get Python to even recognize the character, let alone replace it. I tried the following which did not work

df['var'].apply(lambda x: re.sub(u'\2014','',x))

I also tried simply

df['var'].astype('float')

What would be the best way to get all the em-dashs in a dataframe to convert to nans, while keeping the numeric data as floats?

LauraF
  • 345
  • 2
  • 5
  • 11
  • Google a bit better mate because I found [this](https://stackoverflow.com/questions/17097236/how-to-replace-values-with-none-in-pandas-data-frame-in-python?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa) within 10 seconds. – dejoma May 29 '18 at 18:53
  • Duplicate from [this post](https://stackoverflow.com/questions/17097236/how-to-replace-values-with-none-in-pandas-data-frame-in-python?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa). Solution is df.replace() – dejoma May 29 '18 at 18:54
  • The issue is that I have a long dash, not a regular dash. So normal string find/replace functions aren't working. since Python is not matching '-' with '–'. When I keep getting errors like this: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 0: ordinal not in range(128) – LauraF May 29 '18 at 19:20
  • `pd.read_excel` normally does not use ascii codecs. How exactly do you read the excel files? – DYZ May 29 '18 at 19:28
  • And simply copy-pasting that symbol in the same way? So not a hyphen but the symbol you're looking for? – dejoma Jun 06 '18 at 10:10

4 Answers4

5

You should catch the error at an earlier stage. Tell pd.read_excel() to treat em-dashes as NaNs:

df = pd.read_excel(..., na_values=['–','—'])
DYZ
  • 55,249
  • 10
  • 64
  • 93
1

I think the most straightforward way to do this would be pd.to_numeric with the argument errors='coerce':

df['var'] = pd.to_numeric(df['var'], errors='coerce')

From the docs:

If ‘coerce’, then invalid parsing will be set as NaN

sacuL
  • 49,704
  • 8
  • 81
  • 106
0

Not sure exactly what was going on with those dashes (which showed up like u'\u2013' when I would do df.get_value(0,'var')) but I did find a solution that worked, which converted the dashes to nans and kept the numeric data as numbers.

import unicodedata

df['var']=df['var'].map(unicode)
df['var']=df['var'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
df['var']=pd.to_numeric(df['var'])
LauraF
  • 345
  • 2
  • 5
  • 11
-1

df.replace({'-': None}) is what you are looking for. Found in another post on stack overflow.

dejoma
  • 394
  • 1
  • 6
  • 18