using pandas to find number of duplicated sentences in a dataframe

Question

I'm trying to find out how much duplicated sentences does my dataframe has which is any exact match sentences repeated more than one, I'm using Dataframe.Duplicated but it ignore the first oucurency of the sentences and I want it instead of printing the duplicated sentences just print the duplicated sentences one and the number of its occurrence

the code I'm trying is

wdata = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)
data=wdata[wdata.duplicated()]
print(data)



#dataframe example
#hi how are you
#hello sam how are you doing
#hello sam how are you doing
#helll Alex how are you doing
#hello sam how are you doing
#let us go eat
#where is the dog
#let us go eat

I want my output to be something like

#hello sam how are you doing   3
#let us go eat                 2

with duplicated function I get this output

#hello sam how are you doing
#hello sam how are you doing
#let us go eat

this is the output I'm getting with second answer

wdata = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)

data=wdata.groupby(['sentences']).size().reset_index(name='counts')


#                      sentences  counts
#0  hello Alex how are you doing       1
#1   hello sam how are you doing       3
#2                hi how are you       1
#3                 let us go eat       1
#4                let us go eat        1
#5              where is the dog       1

I want my output to be something like

#hello sam how are you doing   3
#let us go eat                 2

for my example its printing back even the sentences with occurrence one that is why I asked it seprateley — programming freak, Feb 17 '20 at 09:16
@jezrael I dont want it to return the sentences with 1 time occurrence any solution for that — programming freak, Feb 17 '20 at 09:19
Can you add your solution and explian, why not working? Not sure if understand — jezrael, Feb 17 '20 at 09:20
So you need `data=wdata.groupby(wdata['sentences'].str.strip()).size().reset_index(name='counts') data = data[data['counts'].gt(1)]` ? — jezrael, Feb 17 '20 at 09:25

score 2 · Accepted Answer · answered Feb 17 '20 at 09:29

Because there are whitespaces, solution is remove them by Series.str.strip with GroupBy.size:

data=wdata.groupby(wdata['sentences'].str.strip()).size().reset_index(name='counts')

And then filter by boolean indexing:

data = data[data['counts'].gt(1)]

Another idea is use Series.value_counts for Series, filtering and last convert to 2 columns DataFrame:

s = wdata['sentences'].str.strip().value_counts()
data = s[s.gt(1)].rename_axis('sentences').reset_index(name='counts')

using pandas to find number of duplicated sentences in a dataframe

1 Answers1