0

I was wondering how to get multiple indexes for a dataframe based on a list that groups elements from another column.

Since it is likely better to show by example, here is a script that displays what I have, and what I would want:

def ungroup_column(df, column, split_column = None):
    '''
    # Summary
        Takes a dataframe column that contains lists and spreads the items in the list over many rows
        Similar to pandas.melt(), but acts on lists within the column

    # Example

        input datframe:

                farm_id animals
            0   1       [pig, sheep, dog]
            1   2       [duck]
            2   3       [pig, horse]
            3   4       [sheep, horse]


        output dataframe:

                farm_id animals
            0   1       pig
            0   1       sheep
            0   1       dog
            1   2       duck
            2   3       pig
            2   3       horse
            3   4       sheep
            3   4       horse

    # Arguments

        df: (pandas.DataFrame)
            dataframe to act upon

        column: (String)
            name of the column which contains lists to separate

        split_column: (String)
            column to be added to the dataframe containing the split items that were in the list
            If this is not given, the values will be written over the original column
    '''
    if split_column is None:
        split_column = column

    # split column into mulitple columns (one col for each item in list) for every row
    # then transpose it to make the lists go down the rows
    list_split_matrix = df[column].apply(pd.Series).T

    # Now the columns of `list_split_matrix` (they're just integers)
    # are the indices of the rows in `df` - i.e. `df_row_idx`
    # so this melt concats each column on top of each other
    melted_df = pd.melt(list_split_matrix, var_name = 'df_row_idx', value_name = split_column).dropna().set_index('df_row_idx')

    if split_column == column:
        df = df.drop(column, axis = 1)
        df = df.join(melted_df)
    else:
        df = df.join(melted_df)
    return df

from IPython.display import display
train_df.index
from utils import *
play_df = train_df
sent_idx = play_df.groupby('pmid')['sentence'].apply(lambda row: range(0, len(list(row)))) #set_index(['pmid', range(0, len())])
play_df.set_index('pmid')

import pandas as pd
doc_texts = ['Here is a sentence. And Another. Yet another sentence.',
            'Different Document here. With some other sentences.']
playing_df = pd.DataFrame({'doc':[nlp(doc) for doc in doc_texts],
                           'sentences':[[s for s in nlp(doc).sents] for doc in doc_texts]})
display(playing_df)
display(ungroup_column(playing_df, 'sentences'))

The output of this is as follows:

doc sentences
0   (Here, is, a, sentence, ., And, Another, ., Ye...   [(Here, is, a, sentence, .), (And, Another, .)...
1   (Different, Document, here, ., With, some, oth...   [(Different, Document, here, .), (With, some, ...
doc sentences
0   (Here, is, a, sentence, ., And, Another, ., Ye...   (Here, is, a, sentence, .)
0   (Here, is, a, sentence, ., And, Another, ., Ye...   (And, Another, .)
0   (Here, is, a, sentence, ., And, Another, ., Ye...   (Yet, another, sentence, .)
1   (Different, Document, here, ., With, some, oth...   (Different, Document, here, .)
1   (Different, Document, here, ., With, some, oth...   (With, some, other, sentences, .)

But I would really like to have an index for the 'sentences' column, such as this:

doc_idx   sent_idx     document                                           sentence
0         0            (Here, is, a, sentence, ., And, Another, ., Ye...   (Here, is, a, sentence, .)
          1            (Here, is, a, sentence, ., And, Another, ., Ye...   (And, Another, .)
          2            (Here, is, a, sentence, ., And, Another, ., Ye...   (Yet, another, sentence, .)
1         0            (Different, Document, here, ., With, some, oth...   (Different, Document, here, .)
          1            (Different, Document, here, ., With, some, oth...   (With, some, other, sentences, .)
chase
  • 3,592
  • 8
  • 37
  • 58

1 Answers1

1

Based on your second output you can reset the index and then set_index based on cumcount of the current index then rename the axis i.e

new_df = ungroup_column(playing_df, 'sentences').reset_index()
new_df['sent_idx'] = new_df.groupby('index').cumcount() 
new_df.set_index(['index','sent_idx']).rename_axis(['doc_idx','sent_idx'])

Output:

                                                               doc       sents
doc_idx sent_idx                                                      
0       0         [Here, is, a, sentence, ., And, Another, ., Ye...     Here is a sentence.
        1         [Here, is, a, sentence, ., And, Another, ., Ye...     And Another.  
        2         [Here, is, a, sentence, ., And, Another, ., Ye...     Yet another sentence.  
1       0         [Different, Document, here, ., With, some, oth...     Different Document here.
        1         [Different, Document, here, ., With, some, oth...     With some other sentences.  

Instead of applying pd.Series you can use np.concatenate to expand the column.( I used nltk to token the words and sentences)

import nltk
import pandas as pd
doc_texts = ['Here is a sentence. And Another. Yet another sentence.',
        'Different Document here. With some other sentences.']
playing_df = pd.DataFrame({'doc':[nltk.word_tokenize(doc) for doc in doc_texts],
                      'sents':[nltk.sent_tokenize(doc) for doc in doc_texts]})

s = playing_df['sents']
i = np.arange(len(df)).repeat(s.str.len())

new_df = playing_df.iloc[i, :-1].assign(**{'sents': np.concatenate(s.values)}).reset_index()

new_df['sent_idx'] = new_df.groupby('index').cumcount()
new_df.set_index(['index','sent_idx']).rename_axis(['doc_idx','sent_idx'])

Hope it helps.

Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
  • Thank you so much! This works well. I was also wondering after looking at the [pandas multiindexing documentation](https://pandas.pydata.org/pandas-docs/stable/advanced.html), if you think there is a more appropriate way for dealing with the multiindex, since I noticed that the 'document' level is not repeated as it is after the `ungroup_column` function I have applied here. – chase Sep 12 '17 at 14:42
  • Glad to help @chase. – Bharath M Shetty Sep 12 '17 at 14:44