1

I am trying to split Comments column into multiple rows containing each sentence. I used the following StackOverflow thread for my reference as it tends to give similar result. Reference Link: pandas: How do I split text in a column into multiple rows? Sample data of dataframe is as below.

Id Team Food_Text 1 X Food is good. It is very well cooked. Delicious! 2 X I hate the squid. Food is not cooked well. At all indeed. 3 X Please do not good anytime over here 4 Y I love the fish. Awesome delicacy. 5 Y Good for desserts. Meat tastes bad

Each record for 'Food_Text' can be of multiple sentences delimited by full-stop or period. I have used the following code

import numpy as np
import pandas as pd

survey_data = pd.read_csv("Food_Dummy.csv")
survey_text = survey_data[['Id','Team','Food_Text']]

# Getting s as pandas series which has split on full stop and new sentence a new line         
s = survey_text["Food_Text"].str.split('.').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'Food_Text' # needs a name to join

# There are blank or emplty cell values after above process. Removing them
s.replace('', np.nan, inplace=True)
s.dropna(inplace=True)
x=s.to_frame(name='Food_Text1')
x.head(10)

# Joining should ideally get me proper output. But I am getting original dataframe instead of split one.
survey_text.join(x)
survey_text.head(10)

I am not sure why the join is not giving me a proper dataframe with more number of rows. Repetition of other columns based on index of split. So Id=1 has 3 sentences so we should have 3 records with all other data same and Food_Text column with a new sentence from a comment by ID=1. Similarly for other records.

Thank You in advance for your help! Regards, Sohil Shah

Sohil
  • 13
  • 1
  • 5

2 Answers2

4

In the example that you put in your code, The result of the join was printed, so if you want to change the value of your survey_text, the code should be:

survey_text = survey_text.join(x)

or if you wanted to simplify your code, this code below is just fine:

import numpy as np
import pandas as pd

survey_data = pd.read_csv("Food_Dummy.csv")
survey_text = survey_data[['Id','Team','Food_Text']]

# Getting s as pandas series which has split on full stop and new sentence a new line
s = survey_text["Food_Text"].str.split('.').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'Food_Text' # needs a name to join

# There are blank or emplty cell values after above process. Removing them
s.replace('', np.nan, inplace=True)
s.dropna(inplace=True)

# Joining should ideally get me proper output. But I am getting original dataframe instead of split one.
del survey_text['Food_Text']
survey_text = survey_text.join(s)
survey_text.head(10)

This way you will not have multiple "Food_Text" columns in yout DataFrame.

Fernandoms
  • 444
  • 3
  • 13
  • Thank You! Both techniques worked! Appreciate your help! – Sohil Feb 28 '19 at 04:49
  • Just 2 quick questions, 1. after splitting in this way, I still have blank rows probably with 1 or 2 spaces. I tried using regex function '\s+' in replace but it did not work. Could you suggest how to get rid of the extra rows with just spaces. 2. Splitting on period breaks down numbers in text like 1.50 or 7.6$ I am okay when numbers go, but could we eliminate or avoid this and consider as one entity. Splitting on '. ' gives one words only for each row which is not what I want. – Sohil Mar 01 '19 at 16:12
0

Instead of

s = survey_text["Food_Text"].str.split('.').apply(pd.Series,1).stack()

Better way to split into sentences is to use nltk sentence tokenizer

from nltk.tokenize import sent_tokenize
s = survey_text["Food_Text"].apply(lambda x : sent_tokenize(x)).apply(pd.Series,1).stack()
Shakti
  • 2,013
  • 8
  • 27
  • 40
  • Can this be adapted to a parallel processing library, like Dask? How can apply a custom sentence - splitting function in something like Dask, such that a new "exploded" dataframe is created? – Chris Oct 12 '20 at 23:38