0

Im doing some text analysis and trying to loop through a dataframe consisting of a list of words in one column and some numeric values in other columns. I want to split out all words in the list column on to different rows and also bring with them the values that were on the same row. I want the code to be usable by others I share with so I've made the code so that they have to input needed columns only once earlier in the code.

I've managed to loop through the dataframe, split out the words and attribute the values when I specify the column names but when I try and make the loop dynamic I cant seem to get the syntax right:

TokensTable = pd.DataFrame({'Token': [], 'Value1': [],'Value2': [],'Value3': []})

counter = 0

for index, row in metricsByToken2.iterrows():           #for each row in the dataframe with values and the token lists

for index2, token in enumerate(row[0]):             #for each token in the list of tokens in each row

    if token.isalpha():                             #If the token doesnt contain punctuation then
        token = token.lower()                       #lowercase the token
        if token in stop_words:                     #if the token is a stop word then
            del token                               #delete the token
        else:
            TokensTable.loc[counter] = [row[0][index2]] + [row[1]] + [row[2]] + [row[3]]
            counter = counter + 1                   #increase counter to move to the next row in new df
    else:
        del token 

So if there was the list ['A','B','C'] with other columns 200,300,400 then I would want that split on to 3 separate rows for example: 'A',200,300,400 then 'B',200,300,400 and 'C',200,300,400.

This code above is what has worked for me so far but I have manually specified [Row[1] + [Row[2] etc. [row[0][index2]] will be there for every time I run the code so that has to stay but the number of other columns added on the same row will change each time. The number of columns needed will always be as far as len(TokensTable)-1 though so I need to loop somehow from 0 to len(TokensTable)-1 Im guessing but so far I've had no luck figuring this out so any help would be really appreciated

Example Input:

╔══════════════════╦════════╦════════╦════════╗
║       Text       ║ Value1 ║ Value2 ║ Value3 ║
╠══════════════════╬════════╬════════╬════════╣
║ ['A','B','C']    ║      1 ║      3 ║      7 ║
║ ['A1','B1','C1'] ║      2 ║      4 ║      8 ║
║ ['A2','B2','C2'] ║      3 ║      5 ║      9 ║
╚══════════════════╩════════╩════════╩════════╝

Example Output:

╔═══════╦════════╦════════╦════════╗
║ Token ║ Value1 ║ Value2 ║ Value3 ║
╠═══════╬════════╬════════╬════════╣
║ A     ║      1 ║      3 ║      7 ║
║ B     ║      1 ║      3 ║      7 ║
║ C     ║      1 ║      3 ║      7 ║
║ A1    ║      2 ║      4 ║      8 ║
║ B1    ║      2 ║      4 ║      8 ║
║ C1    ║      2 ║      4 ║      8 ║
║ A2    ║      3 ║      5 ║      9 ║
║ B2    ║      3 ║      5 ║      9 ║
║ C2    ║      3 ║      5 ║      9 ║
╚═══════╩════════╩════════╩════════╝
  • please provide a sample input and desired output – Julia May 17 '19 at 16:28
  • Hi @JuliaBogutskaya, I just added in the sample input and output there, thanks for looking! –  May 17 '19 at 17:01
  • 1
    Possible duplicate of [How to unnest (explode) a column in a pandas DataFrame?](https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe) – m13op22 May 17 '19 at 17:03

1 Answers1

0

Thanks for the link @HS-nebula it led me to the answer I needed. In the end I used a loop afterwards to clean the aggregated tokens but to un-nest them I used the following:

TokensTable = metricsByToken2.apply(lambda x: pd.Series(x['Token']),axis=1).stack().reset_index(level=1, drop=True)
TokensTable.name = 'Token'
TokensTable = metricsByToken2.drop('Token', axis=1).join(TokensTable)
TokensTable = TokensTable.reset_index(drop=True)