3

Having trouble figuring out how to lemmatize words from a txt file. I've gotten as far as listing the words, but I'm not sure how to lemmatize them after the fact.

Here's what I have:

import nltk, re
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

def lemfile():
    f = open('1865-Lincoln.txt', 'r')
    text = f.read().lower()
    f.close()
    text = re.sub('[^a-z\ \']+', " ", text)
    words = list(text.split())
ArchivistG
  • 168
  • 1
  • 13

3 Answers3

4

Initialise a WordNetLemmatizer object, and lemmatize each word in your lines. You can perform inplace file I/O using the fileinput module.

# https://stackoverflow.com/a/5463419/4909087
import fileinput

lemmatizer = WordNetLemmatizer()
for line in fileinput.input('1865-Lincoln.txt', inplace=True, backup='.bak'):
    line = ' '.join(
        [lemmatizer.lemmatize(w) for w in line.rstrip().split()]
    )
    # overwrites current `line` in file
    print(line)

fileinput.input redirects stdout to the open file when it is in use.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Does that mean that I don't have to first list out the words like I have? – ArchivistG Mar 17 '18 at 20:56
  • @ArchivistG You might still need to clean your sentences (using regex, I've omitted this step for simplicity). There is nothing to list, just lemmatise words and write to your file. – cs95 Mar 17 '18 at 20:57
  • So, I tried it, and it didn't QUITE lemmatize anything. It removed the occasional "s" or "ss" , but that's it. It turned "less" into "le", for instance. – ArchivistG Mar 17 '18 at 21:13
  • @ArchivistG Maybe you want to use a stemmer instead? Stemmers are rule based, lemmatisers... it depends. – cs95 Mar 17 '18 at 21:15
  • Well, I want the word origin. So "cats" becomes "cat", "cacti" becomes "cactus" etc. – ArchivistG Mar 17 '18 at 21:19
  • @ArchivistG Works for me when I try it. What issues are you facing? – cs95 Mar 17 '18 at 21:22
  • What I mentioned earlier. All words ending in "ing" still ended in "ing" and all it did was remove a few random "s" at the end of words that didn't need it to be removed to still be the root, such as "less" – ArchivistG Mar 17 '18 at 21:24
  • 1
    @ArchivistG If you look at the docs, `lemmatize` accepts a second argument which is the Part Of Speech (noun, verb, etc). All words are nouns by default, so verbs with -ing are not lemmatized unless you set `pos='v'`. You may instead use `lemmatizer.lemmatize(lemmatizer.lemmatize(w), pos='v')` but beware... it's slow. – cs95 Mar 17 '18 at 21:28
  • I'm going to give that a try when I get home. Leaving a coffee shop now. Thanks! – ArchivistG Mar 17 '18 at 21:30
  • 1
    @ArchivistG Good luck! Feel free to ping if you need any more help (although I'm not sure what I could do beyond this point if this doesn't work for you :p) – cs95 Mar 17 '18 at 21:32
  • So I ran it as a program with the pos='v' and it DID lemmatize the verbs, which is awesome. But it's still turning "less" to "le" (but leaves "address" alone oddly), "was" to "wa", and occasionally it decides not to even singularize words. I'm looking at the word "slave" next to another instance of "slaves." Both were "slaves" before the lemmatization. (It's a Lincoln speech) – ArchivistG Mar 18 '18 at 01:46
0

You can also try a wrapper around the NLTK'sWordNetLemmatizer in the pywsd package, specifically, https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L129

Install:

pip install -U nltk
python -m nltk.downloader popular
pip install -U pywsd

Code:

>>> from pywsd.utils import lemmatize_sentence
>>> lemmatize_sentence('These are foo bar sentences.')
['these', 'be', 'foo', 'bar', 'sentence', '.']
>>> lemmatize_sentence('These are foo bar sentences running.')
['these', 'be', 'foo', 'bar', 'sentence', 'run', '.']

Specifically to your question:

from __future__ import print_function
from pywsd.util import lemmatize_sentence 

with open('file.txt') as fin, open('outputfile.txt', 'w') as fout
    for line in fin:
        print(' '.join(lemmatize_sentence(line.strip()), file=fout, end='\n')
alvas
  • 115,346
  • 109
  • 446
  • 738
0

Lemmatizing txt file and replacing only lemmatized words can be done as --`

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from pywsd.utils import lemmatize_sentence

lmm = WordNetLemmatizer()
ps = PorterStemmer()

new_data= []

with open('/home/rahul/Desktop/align.txt','r') as f:
f1 = f.read()
f2 = f1.split()
en_stops = set(stopwords.words('english'))
hu_stops = set(stopwords.words('hungarian'))

all_words = f2 
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~[<p>]'''
#if lemmatization of one string is required then uncomment below line
#data='this is coming rahul  schooling met happiness making'
print ()
for line in all_words:
    new_data=' '.join(lemmatize_sentence(line))
    print (new_data)

PS- Do identation as per your need. Hope this helps!!!