I have a dataframe df
with a column "Content" that contains a list of articles extracted from the internet. I have already the code for constructing a dataframe with the expected output (two columns, one for the word and the other for its frequency). However, I would like to exclude some words (conectors, for instance) in the analysis. Below you will find my code, what should I add to it?
It is possible to use the code get_stop_words('fr')
for a more efficiente use? (Since my articles are in French).
Source Code
import csv
from collections import Counter
from collections import defaultdict
import pandas as pd
df = pd.read_excel('C:/.../df_clean.xlsx',
sheet_name='Articles Scraping')
df = df[df['Content'].notnull()]
d1 = dict()
for line in df[df.columns[6]]:
words = line.split()
# print(words)
for word in words:
if word in d1:
d1[word] += 1
else:
d1[word] = 1
sort_words = sorted(d1.items(), key=lambda x: x[1], reverse=True)