1

I'm trying to find out how much duplicated sentences does my dataframe has which is any exact match sentences repeated more than one, I'm using Dataframe.Duplicated but it ignore the first oucurency of the sentences and I want it instead of printing the duplicated sentences just print the duplicated sentences one and the number of its occurrence

the code I'm trying is

wdata = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)
data=wdata[wdata.duplicated()]
print(data)



#dataframe example
#hi how are you
#hello sam how are you doing
#hello sam how are you doing
#helll Alex how are you doing
#hello sam how are you doing
#let us go eat
#where is the dog
#let us go eat 


I want my output to be something like

#hello sam how are you doing   3
#let us go eat                 2

with duplicated function I get this output

#hello sam how are you doing
#hello sam how are you doing
#let us go eat

this is the output I'm getting with second answer

wdata = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)

data=wdata.groupby(['sentences']).size().reset_index(name='counts')


#                      sentences  counts
#0  hello Alex how are you doing       1
#1   hello sam how are you doing       3
#2                hi how are you       1
#3                 let us go eat       1
#4                let us go eat        1
#5              where is the dog       1

I want my output to be something like

#hello sam how are you doing   3
#let us go eat                 2

programming freak
  • 859
  • 5
  • 14
  • 34

1 Answers1

2

Because there are whitespaces, solution is remove them by Series.str.strip with GroupBy.size:

data=wdata.groupby(wdata['sentences'].str.strip()).size().reset_index(name='counts')

And then filter by boolean indexing:

data = data[data['counts'].gt(1)]

Another idea is use Series.value_counts for Series, filtering and last convert to 2 columns DataFrame:

s = wdata['sentences'].str.strip().value_counts()
data = s[s.gt(1)].rename_axis('sentences').reset_index(name='counts')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252