I have a dataframe that consists of two columns: ID
and TEXT
. Pretend data is below:
ID TEXT
1 The farmer plants grain. The fisher catches tuna.
2 The sky is blue.
2 The sun is bright.
3 I own a phone. I own a book.
I am performing cleansing on the TEXT
column with nltk, so I need to convert the TEXT
column to a list:
corpus = df['TEXT'].tolist()
After performing the cleansing (tokenization, removing special characters, and removing stopwords), the output is a "list of lists" and looks like this:
[[['farmer', 'plants', 'grain'], ['fisher', 'catches', 'tuna']],
[['sky', 'blue']],
[['sun', 'bright']],
[['I', 'own', 'phone'], ['I', 'own', 'book']]]
I know how to get a list back into a pandas dataframe, but how do I get the list of lists back into a pandas dataframe with the ID column still assigned to the text? My desired output is:
ID TEXT
1 'farmer', 'plants', 'grain'
1 'fisher', 'catches', 'tuna'
2 'sky', 'blue'
2 'sun', 'bright'
3 'I', 'own', 'phone'
3 'I', 'own', 'book'
I'm assuming it is something simple related to conversion between Python data structures, but I'm not sure where to start with this. The specific work product here is less important than the concept of dataframe --> native Python data structure --> do something to native Python data structure --> dataframe with original attributes intact.
Any insight you all can provide is greatly appreciated! Please let me know if I can improve my question at all!