-3

Below is a snapshot of my data structure in pandas

  • I build the below structure in a for loop

  • I am using sortlevel to lexsort the dataframe

    df.sortlevel(inplace=True)

enter image description here

1) I need to get an efficient way to get and set specific rows as shown below. This is the formula i am using and it is not efficient.
a) Will i be able to set the values of the rows using assignment

df.loc[idx['AAA', slice(None),'LLL']].iloc[:,0:n]

df.loc[idx['AAA', slice(None),'LLL']].iloc[:,0:n] = another_df

enter image description here

2) How to Efficiently sum the columns for a result below

df.loc[idx['AAA', slice(None),'LLL']].iloc[:,0:n].sum(axis=1)

enter image description here

I am looking for an efficient way to slice the dataframe.

Thanks

Community
  • 1
  • 1
Uma Maheshwaraa
  • 563
  • 2
  • 9
  • 17

1 Answers1

0

Thanks for letting me know the right way to post questions for Pandas. Any case, below are my findings regarding this problem. Multindex is certainly powerful from the standpoint of organizing data and exporting to csv or excel. However, accessing and selecting data has been challenging to accomplish.

Best Practices for initializing multi index

  1. I found it easy to pre allocate the index rather than create them on the fly. Creating index on the fly is not efficient and you will be faced with lexsort warning.
  2. Sort the data frame index once the data frame is initialized.
  3. When accessing do not leave the row or column identifier empty. Use :

    for site_name in site_s:
    
    no_of_progs =  len(site_s[site_name])
    prog_name_in_sites = site_s[site_name].keys()
    prog_level_cols = ['A','B', 'C']
    prog_level_cols = ['A', 'C']
    
    site_level_cols = ['A PLAN', 'A TOTAL','A UP','A DOWN','A AVAILABLE' ]
    
    if counter == 0:
        pd_index_col = pd.MultiIndex.from_product([ [site_name], prog_name_in_sites,prog_level_cols],
                         names=['SITE', 'PROGRAM','TYPE'])
    else: 
        pd_index_col = pd_index_col.append(pd.MultiIndex.from_product([ [site_name], prog_name_in_sites,prog_level_cols],
                         names=['SITE', 'PROGRAM','TYPE']))
    if no_of_progs >1:
        pd_index_col = pd_index_col.append(pd.MultiIndex.from_product([ [site_name], ['LINES']  ,site_level_cols],
                         names=['SITE', 'PROGRAM','TYPE']))
    counter = counter+1
    
    df_A_site_level = pd.DataFrame(0,columns=arr_wk_num_wkly,index= pd_index_col, dtype=np.float64)
    
    df_A_site_level.sort_index(inplace=True)
    

For setting and getting below are the two methods I recommend

  • df.iloc - if you know the positional index of rows and/or columns
  • df.loc - if you want to access data based on labels

Accessing using loc - Use the below to set or get cell/row values

idx = pd.IndexSlice
df_A_site_level[idx[site_name, :,'C'], df_A_site_level[0:no]]

Accessing using iloc - Use the below to set or get cell/row values

df_A_site_level.iloc[no_1:no_2,no3:no_4]
Uma Maheshwaraa
  • 563
  • 2
  • 9
  • 17