how to count the number of repetation of words and assign a number and append into dataframe

Question

I am having a dataset of all the abstracts and the author gender. Now i want to get the all the repetitions of words gender wise so that i can plot it as a graph number of repetition of words with respect to gender.

data_path = '/content/digitalhumanities - forum-and-fiction.csv'
def change_table(data_path):
  df = pd.read_csv(data_path)
  final = df.drop(["Title", "Author", "Season", "Year", "Keywords", "Issue No", "Volume"], axis=1)
  fin = final.set_index('Gender')
  return fin
change_table(data_path).T

This is the out put i got 
| Gender   | None                                              | Female                                            | Male                                              | None       | None                                  | Male                                              ,Female                                            |None                                              | Male                                             ,Female                                            |
|:----------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------|------------|---------------------------------------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------:|
| Abstract | This article describes Virginia Woolf's preocc... | The Amazonian region occupies a singular place... | This article examines Kipling's 1901 novel Kim... | Pamela; or | Virtue Rewarded uses a literary fo... | This article examines Nuruddin Farah's 1979 no... | Ecological catastrophe has challenged the cont... | British political fiction was a satirical genr... | The Lydgates have bought too much furniture an...

Now how can i get the repetition of each word in the abstract with respect to gender and append to the data frame.

Expecting output example

|gender|male|female|none|
|------|----|------|----|
| This    |    3|     0|   0|
|   occupies  |    5|     3|   0|
| examines    |    6|      0|   0|
|   British  |    0|      0|    7|

. . .

@jezrael i have some columns with both female and male as well separated by comma. — , Mar 17 '22 at 08:16

jezrael · Accepted Answer · 2022-03-17T09:14:10.970

0

Use crosstab with splitting stacked values by DataFrame.stack:

#removed T
df = change_table(data_path)

#reshape with split columns
df = (df.stack()
        .rename_axis(('Type','Gender'))
        .str.split(expand=True)
        .stack()
        .reset_index(name='Word'))

#explode Type by split with ,
df = df.assign(Type = df['Type'].str.split(',')).explode('Type')

#remove stpowords
from nltk.corpus import stopwords    
stop_words = set(stopwords.words('english'))
    
df = df[~df['Word'].isin(stop_words)]

#remove punctuation
df['Word'] = df['Word'].str.replace(r'[^\w\s]+', '')

#get counts per Gender, Word and Type
df1 = pd.crosstab([df['Gender'], df['Word']], df['Type']).reset_index()

#or get counts per Word and Type
df2 = pd.crosstab([df['Word'], df['Type'])

edited Mar 17 '22 at 09:14

answered Mar 17 '22 at 07:24

jezrael

822,522
95
1,334
1,252

Have u considered that some columns have both male and female separated as a comma. I need to separate it and add both to male and female and assign the word count. And when i check the data after splitting i am getting special charters and stop words as well how to remove them. @jezrael – Mar 17 '22 at 08:13
@daylightisminetocommand - added to answer explode by Type with `,` and remove stopwords. What means special characters? Can you explain more? – jezrael Mar 17 '22 at 08:25
?, " ", . , : , /, { , } , !, #, %, & . These characters are present in my text and i want to remove them @jezrael – Mar 17 '22 at 08:29
I have added the image. Please take a look you will understand what i mean special characters and why i want to be removed – Mar 17 '22 at 08:31
@daylightisminetocommand - for remove punctionions use `df['text'] = df['text'].str.replace(r'[^\w\s]+', '')` - solution from [this](https://stackoverflow.com/questions/50444346/fast-punctuation-removal-with-pandas) – jezrael Mar 17 '22 at 09:13

how to count the number of repetation of words and assign a number and append into dataframe

1 Answers1