1

I can't use nltk because of download issues at work I wanted to create an function that removes stopwords(dutch). I have an text file with dutch stopwords, and i want to read in and use to find stopwords in en pandas dataframe. I saved the datafile as an txt. file, but i get duplicates. Could someone help me with this issues, i wrote the function below.

import pandas as pd 
import numpy as np
import re 
from nltk.tokenize import word_tokenize

dictionary = {'í':'i','á':'a','ö': 'o','ë':'e'}
pd.set_option('display.max_colwidt',-1)
df = pd.read_csv('Map1.csv', error_bad_lines=False, encoding='latin1')
df.replace(dictionary, regex=True, inplace=True)
# I want to remove it from df['omschrijving skill']
stopwords =['de','Een','van','ik','te','dat','die','in','een','hij','het','niet','zijn','is','was','of','aan']
querywords = query.split()

resultwords  = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)

print(result)
  • Where is the pandas data frame in your code? – Mykola Zotko Jan 28 '21 at 13:25
  • Does this answer your question? [Removing list of words from a string](https://stackoverflow.com/questions/25346058/removing-list-of-words-from-a-string) – Mykola Zotko Jan 28 '21 at 13:27
  • @MykolaZotko i wrote my pandas frame to a txt file – Leyla Elkhamlichi Jan 29 '21 at 08:52
  • @MykolaZotko and i saw this subject but is not working, i want to delete the stop word is a column of an Dataframe, but some how its not working do you have a advise how i can handle this problem – Leyla Elkhamlichi Jan 29 '21 at 08:55
  • You don't need to save your pandas dataframe to text file. You can remove stop words directly from the dataframe. – Mykola Zotko Jan 29 '21 at 09:20
  • @MykolaZotko Im stuck at the point of remove the words from Dataframe, i have one colums where i want to remove is, but i don't get how i can remove this by using the above code? could you give me some advise i updated my code again – Leyla Elkhamlichi Feb 09 '21 at 15:15

1 Answers1

0

Perhaps to use something like this:

from nltk.tokenize import word_tokenize
# you could get those from here https://raw.githubusercontent.com/stopwords-iso/stopwords-nl/master/stopwords-nl.txt
stopwords_to_remove = ['aan',
'aangaande',
'aangezien',
'achte',
'achter',
'achterna']

text = "Nick likes achter to play football, aangezien however he is achter not too fond of tennis."
#text_tokens = word_tokenize(text)
text_tokens = [word for word in text.split(' ')]

tokens_without_sw = [word for word in text_tokens if not word in stopwords_to_remove]

print(tokens_without_sw)
['Nick', 'likes', 'to', 'play', 'football,', 'however', 'he', 'is', 'not', 'too', 'fond', 'of', 'tennis.']

Paraphrasing the above for a dataframe

import pandas as pd
import string
# you could get those from here https://raw.githubusercontent.com/stopwords-iso/stopwords-nl/master/stopwords-nl.txt
stopwords_to_remove = ['aan',
'aangaande',
'aangezien',
'achte',
'achter',
'achterna']
df = pd.DataFrame(['aangezien however he is achter not too '  , 'achter to play football'])
def a_tokenizer(x):
    # to remove punctuation
    x = x.translate(str.maketrans('', '', string.punctuation))
    # to lower case and create tokens
    text_tokens = [word.lower() for word in str(x).split(' ')]
    # to remove stopwords
    tokens_without_sw = [word for word in text_tokens if not word in stopwords_to_remove]
    return tokens_without_sw

df[0].apply(lambda x: a_tokenizer(x))
0    [however, he, is, not, too, ]
1             [to, play, football]
Rafael Valero
  • 2,736
  • 18
  • 28
  • @rafeal this can work only i dont know how to use it for a specife column in my dataframe, there is where im stuck. do you have some advise – Leyla Elkhamlichi Feb 09 '21 at 15:19