0

I have a dataframe df with a column "Content" that contains a list of articles extracted from the internet. I have already the code for constructing a dataframe with the expected output (two columns, one for the word and the other for its frequency). However, I would like to exclude some words (conectors, for instance) in the analysis. Below you will find my code, what should I add to it?

It is possible to use the code get_stop_words('fr') for a more efficiente use? (Since my articles are in French).

Source Code

    import csv
    from collections import Counter
    from collections import defaultdict

    import pandas as pd


    df = pd.read_excel('C:/.../df_clean.xlsx', 
                                sheet_name='Articles Scraping')
    df = df[df['Content'].notnull()]
    d1 = dict()

    for line in df[df.columns[6]]:
        words = line.split()
        # print(words)
        for word in words:
            if word in d1:
                d1[word] += 1
            else:
                d1[word] = 1

    sort_words = sorted(d1.items(), key=lambda x: x[1], reverse=True)
  • [How to filter Pandas dataframe using 'in' and 'not in' like in SQL](https://stackoverflow.com/q/19960077/16509954) – the_strange Oct 25 '22 at 11:51
  • 3
    Create a list of words and then use the string contains to return a dataframe where you apply the NOT condition: `~df.C.str.contains()`. – DarknessPlusPlus Oct 25 '22 at 11:52
  • Create a list of "connectors" that you want to avoid, then apply an "include" operator, you could follow the python operator `in` or you could use same operator from the `pandas` library, it is up to you. The "include" operator should be write between the `for` loop and the first `if` conditional. – Franco Gil Oct 25 '22 at 12:04
  • Thank you for your answer. However, it's not that clear for me where and how exactly to add this "include" operator. Would it be possible for you to include it in my code? Thanks in advance! – Nico Larrea Avila Oct 25 '22 at 12:19

1 Answers1

0

There are a few ways you can achieve this. You can either use the isin() method with a list comprehension,

data = {'test': ['x', 'NaN', 'y', 'z', 'gamma',]}

df = pd.DataFrame(data)

words = ['x', 'y', 'NaN']

df = df[~df.test.isin([word for word in words])]

Or you can go with not string contains and a join:

df = df[~df.test.str.contains('|'.join(words))]

If you want to utilize the stop words package for French, you can also do that, but you must preprocess all of your texts before you start doing any frequency analysis.

french_stopwords = set(stopwords.stopwords("fr")) 
        
STOPWORDS = list(french_stopwords)
STOPWORDS.extend(['add', 'new', 'words', 'here'])

I think the extend() will help you tremendously.

DarknessPlusPlus
  • 543
  • 1
  • 5
  • 18
  • Thanks! But when i run this command I have "'DataFrame' object has no attribute 'test'" as a result. I think it is because I am analyzing a column in a dataframe a not a dictionary ('data' in your example) – Nico Larrea Avila Oct 25 '22 at 12:14
  • The `test` in `df[~df.test.isin([word for word in words])] ` is the name of the column, so just change it with your column name. I am calling that on my dataframe (`df`), not on `data`. – DarknessPlusPlus Oct 25 '22 at 12:30