0

I would like to remove the extra index called service_type_id that I have not included in my code but it just appear without any reason. I am using Python.

My code is

data_tr = data.groupby(['transaction_id', 'service_type']).sum().unstack().reset_index().fillna(0).set_index('transaction_id')

The output is this table with extra index: enter image description here

I believe it is something to do with the groupby and unstack. Kindly highlight to me why there will be extra index and what should be my code be.

The dataset https://drive.google.com/file/d/1XZVfXbgpV0l3Oewgh09Vw5lVCchK0SEh/view?usp=sharing

Sana Mumtaz
  • 803
  • 7
  • 16
Azul
  • 25
  • 8

2 Answers2

0

It looks like you are trying to make a pivot table of transaction_id and service_type, using service_type_id as value. The reason you are getting the extra index, is because your sum generates a sum for every (numerical) column.

For insight, try to execute just

data.groupby(['transaction_id', 'service_type']).sum()

Since the data uses the label service_type_id, I assume the sum actually only serves the purpose of getting the id value out. A cleaner way to get the desired result is usig a pivot

data_tr = data[['transaction_id'
       , 'service_type'
       , 'service_type_id']
   ].pivot(index = 'transaction_id'
              , columns= 'service_type'
              , values = 'service_type_id'
   ).fillna(0)

Depending on how you like your data structure, you can follow up with a .reset_index()

  • Thank you for your reply. However, the new index, service_type_id, is not available from the original dataset. I do not want it to appear so I should not include it in my code right? – Azul Aug 23 '21 at 09:33
0

I hope pandas.DataFrame.droplevel can do the job for you query

import pandas as pd
df = pd.read_csv('Dataset - Transaction.csv')
data_tr = df.groupby(['transaction_id', 'service_type']).sum().unstack().reset_index().fillna(0).set_index('transaction_id').droplevel(0,1)
data_tr.head(2)

Output

enter image description here

df.groupby(['transaction_id', 'service_type']).sum() takes the sum of numerical field service_type_id

data_tr = df.groupby(['transaction_id', 'service_type']).sum().unstack()
print(data_tr.columns)

MultiIndex([('service_type_id',                 '3 Phase Wiring'),
            ('service_type_id',                   'AV Equipment')
            ...
            ('service_type_id',                   'Yoga Lessons'),
            ('service_type_id',                  'Zumba Classes')],
           names=[None, 'service_type'], length=188)
#print(data_tr.info())

Initially there was only one column (service_type_id) and two indexes transaction_id, service_type, After you unstack service_type becomes column like tuples (Multindex) where each service type have value of service_type_id. droplevel(0,1) will convert your dataframe from Multindex to single Index as follows

print(data_tr.columns)
Index(['3 Phase Wiring', ......,'Zumba Classes'],
      dtype='object', name='service_type', length=188)
Rinshan Kolayil
  • 1,111
  • 1
  • 9
  • 14
  • Thank you for the help. @Rinshan Kolayil The index service_type_id is not a column in the original dataset. I am still not understand how it come from. Unstack the non-numerical column, service_type, will generate a new column/index and will this change the value on summing the value of the number of service_type cointain for each transaction_id, (['transaction_id', 'service_type']) during the groupby process? – Azul Aug 23 '21 at 09:31
  • I can see there was a column `service_type_id` with numercal values in your dataframe. Please try `print(df.columns)` before group by and you check the columns. After unstack it will not change the values. After unstack `service_type` converted to columns – Rinshan Kolayil Aug 23 '21 at 09:38
  • Thanks for the detail explanation. For the Edit part, the NaN value is fill with zero and not being removed during the groupby of transaction_id, am I right? May I know what does the 0 refer to in the following code? df2.loc[0,"transaction_id"] = np.NAN – Azul Aug 23 '21 at 10:35
  • Yes NaN values filled with zeros with fillna(0) and df2.loc[0,"transaction_id"] refers to transaction_id of first row. In your case fillna works after group by. Please do some experiments with first 5 rows for better understanding – Rinshan Kolayil Aug 23 '21 at 10:50
  • Thank you for the explanation. I have another question in Stack overflow. Do you mind to help me as well? The link is below. https://stackoverflow.com/questions/68891426/apriori-algorithm-not-showing-result – Azul Aug 23 '21 at 11:00