Pandas Recognizes Empty Cell From CSV as EMPTY SPACE Instead of nan

Question

I have a Data Frame that I read in as,

df = pd.read_csv(r'path\file.csv', encoding = "ISO-8859-1")

This is how it looks,

Machine ID  Machine June    July   August
0   100     ABC      10     12     nan
1   100     ABC      nan    15     15
2   101     CDQ      12            20
3   101     CDQ      15     32     11

And data types:

Machine ID  int 64
Machine     object
June        float64
July        object
August      float64

When I try to groupby like this,

machine_group = df.groupby(['Machine ID','Machine'])\['June', 'July', 'August'].sum()\
                    .reset_index()

I only get June and August as July has an empty space/ empty string.

           ID    Machine     June    August 
0         100     ABC        10      15
1         101     CDQ        27      31

Therefore, I tried the fllowoing,

df = df.apply(pd.to_numeric, errors = 'ignore')

This did not convert my July column to numeric/float64.

Next, I tried this,

df.replace(r'\s+', np.nan, regex=True)

This also, did not work. I still have the empty space in my data frame. Not sure what to do.

I was reading this post, seems like I have the inverse issue of this.

How can I make sure I have nan instead of empty string? because that empty string in July column makes the column to be object and it doesn't count for aggregation in groupby clause.

(I checked the original .csv file and that exact line, it is normal empty cell as others, where other empty cells get read in as nan and this particular one is not)

Any suggestions would be nice.

@Wen I did that and it still shows `July` as `object`. And when I do `group by` I still don't get `July` :( — user9431057, Aug 16 '18 at 16:09
@Wen yes, I have been reading some [docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html), but what confusing me is it says I have to pass a `dict` for `na_values`. — user9431057, Aug 16 '18 at 16:16
@ALollz when I tired what you suggested, it did converted all the columns that has empty to `nan`. And even converted all columns that have an empty space. In machine column some of them I have a space in names, so, it did converted that into `float64`. Also, I don't get any results in my `groupby` clause. I get an empty dataframe. I am confused now. — user9431057, Aug 16 '18 at 16:21
@Wen since you suggested `na_values` I found (this)[https://stackoverflow.com/questions/16157939/pandas-read-csv-fills-empty-values-with-string-nan-instead-of-parsing-date] and #bdiamante's answer suggests while reading `na_values = ['nan', '']` still did not work. I still have that empty spot. — user9431057, Aug 16 '18 at 16:48

score 1 · Answer 1 · answered Aug 16 '18 at 17:11

My initial thought was to drop the row that has an empty space in July column. Although I did not want to because what if I have a significant value in other columns that is needed for analysis.

However, for now, I found a solution just because of empty space, July is object type. Using the following,

df['July'] = pd.to_numeric(df['July'], errors='coerce')

I can manually transform into a float64 type. And I could get my groupby to work.

However, it would be ideal to deal with it when I read in the data frame such as na_values = ['nan', ''] and as @Nick Tallant suggested. Unfortunately, they did not work for me.

Nick Tallant · Answer 2 · 2018-08-16T16:35:52.573

0

You might try specifying the data types for the columns, so that any empty spaces/strings are NaN. You can try using dtype or converters.

df = pd.read_csv(r'path\file.csv', encoding = "ISO-8859-1"
                , dtype={'June': int, 'July':int, 'August':int})

df = pd.read_csv(r'path\file.csv', encoding="ISO-8859-1" , converters={'June': int, 'July':int, 'August':int})

Edit: You can also try numpy dtypes as well (https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html)

edited Aug 16 '18 at 16:35

answered Aug 16 '18 at 16:20

Nick Tallant

315
3
6

1

when I do `'July':int'`, I get an error: `ValueError: cannot safely convert passed user dtype of int32 for float64 dtyped data in column`. Not sure, why it behaves this way! – user9431057 Aug 16 '18 at 16:25
@user9431057 Can you post a few sample rows of the csv? And also, have you tried different encodings? – Nick Tallant Aug 16 '18 at 16:32
My file is huge and only one place has an empty value as posted above. – user9431057 Aug 16 '18 at 16:34
@user9431057 After looking at the docs - converters might work instead of dtype – Nick Tallant Aug 16 '18 at 16:36
@user9431057 Int column can not contain nulls – Dima Fomin Aug 15 '19 at 18:46

Pandas Recognizes Empty Cell From CSV as EMPTY SPACE Instead of nan

2 Answers2