2

I have a column in pandas dataframe with millions of rows. Many words are non-English (e.g. words from other languages or that do not mean anything, like "**5hjh"). I thought of using Wordnet as a comprehensive English dictionary to help me clean up this column, which comprises lists. Ideally, the output should be a new column with English words only.

I have tried the following code, which I got from Stackoverflow, but it does not seem to be working as it returns an empty column with no words whatsoever:

from nltk.corpus import wordnet

def check_for_word(s):
    return ' '.join(w for w in str(s).split(',') if len(wordnet.synsets(w)) > 0)

df["new_column"] = df["original_column"].apply(check_for_word)
feroli
  • 47
  • 5
  • You need to provide more information in order for us to help, how does the data in the column look like? Is there only a word per row? If so why split an already word string? – DPM Jun 07 '22 at 14:44
  • 1
    Hello! Thanks for your question. The column is comprised of strings with several words separated by commas. For instance: first row: [mr, ugo, sacchetti, october, jack, d]; second row: [36200, itt, world, communications, inc]. I would like only the English words to be saved as strings separed by commas in the new column – feroli Jun 07 '22 at 14:46
  • Sorry, actually the rows are lists. – feroli Jun 07 '22 at 15:00
  • What are you trying to do? What is the real business problem you want to solve? There are ways to solve the actual problem. NLTK can determine the language of a phrase, but that's not needed to store Unicode text to a file or database - just store the text as UTF8 or UTF16. – Panagiotis Kanavos Jun 07 '22 at 15:00
  • 1
    [This similar question](https://stackoverflow.com/questions/3182268/nltk-and-language-detection) has a lot of answers that use packages like langdetect, langid or NLTK. – Panagiotis Kanavos Jun 07 '22 at 15:03
  • 1
    @PanagiotisKanavos, thanks. My column comprises lists of words that are quite polluted by non-English words that mean nothing (these are documents that have been OCRed). As part of the pre-processing stage, I need to delete them from my database so only the English words remain. My initial idea was to use Counter to create a file with the frequency of all words, then manually select the meaningless ones (which usually only appear once) and use this file in a function to eliminate these words from my database. But this will demand a lot of time. I will continue in the next comment... – feroli Jun 07 '22 at 15:27
  • So, I found this problem (https://stackoverflow.com/questions/50533070/how-to-quickly-check-strings-for-correct-english-words-python) and thought of trying it. – feroli Jun 07 '22 at 15:30
  • @PanagiotisKanavos, I have used nltk: from nltk.corpus import wordnet, because it contains a more comprehensive list than nltk: from nltk.corpus import words – feroli Jun 07 '22 at 15:37

1 Answers1

3

This expression str(s).split(',') creates a list of strings that contain whitespace as the first character for all words except the first one (assuming the str(s) worked as expected). When you then do this: wordnet.synsets(w) you basically look up w which has the whitespace as the first character in wordnet and it is not there, so all synsets will be of length 0.

E.g. len(wordnet.synsets(' october')) will be zero.

I recommend debugging to

  1. check that the str(s) really creates a proper string and
  2. make sure your 'w's are actually the words (e.g. do not start with whitespace). A simple solution could be to use the .trim() method if the only issue is the whitespace

If you provide a df and a screenshot of your output for that df, it would be easier to pinpoint the issue.

Update: addiditional points based on your comments above: Thank you, Fernanda. I've read your comments above (in the main thread). Here are a few more items you might find relevant:

  • wordnet contains only a few adverbs, so in your approach, you might be losing some adverbs
  • the synset counting is a bit slow. I'd use instead:
if in wordnet.synsets(word):

syntaxis. Maybe it will be faster

  • be careful with the idea of using word occurrence counting idea, as a large proportion of totally valid words is rare (appear only once in the corpus even for large corpora). This is related to Zipf law.
  • consider regular expressions based method to filter out words which contain unusual characters
  • Hello, Ivan! Thanks a lot for your response. I will try to apply it and will let you know if it worked! – feroli Jun 21 '22 at 14:15
  • Thank you Fernanda. I've read your comments above (in the main thread). Here are a few more items you might find relevant: - wordnet contains only a few adverbs, so in your approach you might be loosing some adverbs – Ivan Gordeli Jun 23 '22 at 09:28
  • Hello Ivan! This is actually a really important consideration. The thing is: after OCRing my docs, many words came out wrong or were misspelled. I am trying to save the words that I can with a reduce-lengthening function and pyspellchecker library. Still, many 'useless' words are left, which will create noise in my analysis. My idea was to use Wordnet to keep English words only, but I do need to keep verbs, adverbs, nouns and adjectives for sure. Would you maybe have a better suggestion of a dictionary/approach that I could use that would save as much data as possible? Thank you very much! – feroli Jun 24 '22 at 10:23
  • Hi Fernanda, I've added an update based on all your comments to my answer. Not sure if you saw it. Bottom line, I do not know of a dictionary-based method that would work perfectly here. Maybe throwing out some adverbs is a reasonable compromise. Alternatively, I would try to identify what kind of non-words you have in your data and filter them out using regular expressions. For instance, you could easily filter out all words which contain special characters or numbers – Ivan Gordeli Jun 27 '22 at 17:18
  • Actually, you could use a lemmatizer and then check if the word is in a English dictionary. Just do not use a lemmatizer based on wordnet and use a full dictionary (rather than a Wordnet dictionary). E.g. you could use rule-based lemmatizers https://spacy.io/usage/linguistic-features#lemmatization – Ivan Gordeli Jun 28 '22 at 11:14
  • Hello Ivan, thanks a lot for your advice. In the end, I have decided not to use pyspellchecker library, because it was inserting more errors than correcting misspelled words. With regards to the dictionary, I will try a different approach by using fasttext to see the language of the documents and then select those that are written in English. Let's see what comes out of it. In case this does not work and I am still left with many errors that create noise, I will then use a regular dictionary to select English words only. – feroli Jul 06 '22 at 13:48