1

I have a Data Frame that I read in as,

df = pd.read_csv(r'path\file.csv', encoding = "ISO-8859-1")

This is how it looks,

Machine ID  Machine June    July   August
0   100     ABC      10     12     nan
1   100     ABC      nan    15     15
2   101     CDQ      12            20
3   101     CDQ      15     32     11

And data types:

Machine ID  int 64
Machine     object
June        float64
July        object
August      float64

When I try to groupby like this,

machine_group = df.groupby(['Machine ID','Machine'])\['June', 'July', 'August'].sum()\
                    .reset_index() 

I only get June and August as July has an empty space/ empty string.

           ID    Machine     June    August 
0         100     ABC        10      15
1         101     CDQ        27      31

Therefore, I tried the fllowoing,

df = df.apply(pd.to_numeric, errors = 'ignore')

This did not convert my July column to numeric/float64.

Next, I tried this,

df.replace(r'\s+', np.nan, regex=True)

This also, did not work. I still have the empty space in my data frame. Not sure what to do.

I was reading this post, seems like I have the inverse issue of this.

How can I make sure I have nan instead of empty string? because that empty string in July column makes the column to be object and it doesn't count for aggregation in groupby clause.

(I checked the original .csv file and that exact line, it is normal empty cell as others, where other empty cells get read in as nan and this particular one is not)

Any suggestions would be nice.

user9431057
  • 1,203
  • 1
  • 14
  • 28
  • df = df.replace(r'\s+', np.nan, regex=True) – BENY Aug 16 '18 at 16:02
  • try `df['July'] = pd.to_numeric(df.July, errors='coerce')` – ALollz Aug 16 '18 at 16:03
  • @Wen I did that and it still shows `July` as `object`. And when I do `group by` I still don't get `July` :( – user9431057 Aug 16 '18 at 16:09
  • Try to look at na_values in pd.read_csv – BENY Aug 16 '18 at 16:11
  • @Wen yes, I have been reading some [docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html), but what confusing me is it says I have to pass a `dict` for `na_values`. – user9431057 Aug 16 '18 at 16:16
  • @ALollz when I tired what you suggested, it did converted all the columns that has empty to `nan`. And even converted all columns that have an empty space. In machine column some of them I have a space in names, so, it did converted that into `float64`. Also, I don't get any results in my `groupby` clause. I get an empty dataframe. I am confused now. – user9431057 Aug 16 '18 at 16:21
  • @Wen since you suggested `na_values` I found (this)[https://stackoverflow.com/questions/16157939/pandas-read-csv-fills-empty-values-with-string-nan-instead-of-parsing-date] and #bdiamante's answer suggests while reading `na_values = ['nan', '']` still did not work. I still have that empty spot. – user9431057 Aug 16 '18 at 16:48
  • Can you please provide the contents of your `.csv` file? – jeschwar Aug 16 '18 at 17:25

2 Answers2

1

My initial thought was to drop the row that has an empty space in July column. Although I did not want to because what if I have a significant value in other columns that is needed for analysis.

However, for now, I found a solution just because of empty space, July is object type. Using the following,

df['July'] = pd.to_numeric(df['July'], errors='coerce')

I can manually transform into a float64 type. And I could get my groupby to work.

However, it would be ideal to deal with it when I read in the data frame such as na_values = ['nan', ''] and as @Nick Tallant suggested. Unfortunately, they did not work for me.

user9431057
  • 1,203
  • 1
  • 14
  • 28
0

You might try specifying the data types for the columns, so that any empty spaces/strings are NaN. You can try using dtype or converters.

df = pd.read_csv(r'path\file.csv', encoding = "ISO-8859-1"
                , dtype={'June': int, 'July':int, 'August':int})

df = pd.read_csv(r'path\file.csv', encoding="ISO-8859-1" , converters={'June': int, 'July':int, 'August':int})

Edit: You can also try numpy dtypes as well (https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html)

Nick Tallant
  • 315
  • 3
  • 6