I have a textfile with the following format:
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
... with 1 million items
But some of the word_forms contain an apostrophe ('), others do not, so I would like to count them as instances of the same word, that's to say I would like to merge lines like these two:
cup'board cup blabla 12
cupboard cup blabla2 10
into this one (frequencies added):
cupboard cup blabla2 22
I am searching a solution in Python 2.7 to do that, my first idea was to read the textfile, store in two different dictionaries the words with apostrophe and the words without, then go over the dictionary of words with apostrophe, test if these words are already in the dictionary without apostrophe, if they are actualise the frequency, if not simply add this line with apostrophe removed. Here is my code:
class Lemma:
"""Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus"""
def __init__(self,lop):
self.word_form = lop[0]
self.root = lop[1]
self.morph = lop[2]
self.freq = int(lop[3])
def Reader(filename):
"""Keeps the lines of a file in memory for a single reading, memory efficient"""
with open(filename) as f:
for line in f:
yield line
def get_word_dict(filename):
'''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe'''
'''Works in a reasonable time'''
'''This step can be done writing line by line, avoiding all storage in memory'''
word_dict = {}
word_dict_striped = {}
# We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe
with open('word_dict.txt', 'wb') as f:
with open('word_dict_striped.txt', 'wb') as g:
reader = Reader(filename)
for line in reader:
items = line.split("\t")
word_form = items[0]
if "'" in word_form:
# we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped
items[0] = word_form.replace("'","")
items[2] = items[2].replace("\+Apos", "")
g.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict_striped({items[0] : Lemma(items)})
else:
# we just add the lemma to the dictionary word_dict
f.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict.update({items[0] : Lemma(items)})
return word_dict, word_dict_striped
def merge_word_dict(word_dict, word_dict_striped):
'''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key'''
''' Does not run in reasonable time on the whole list '''
with open('word_compiled_dict.txt', 'wb') as f:
for word in word_dict_striped.keys():
if word in word_dict.keys():
word_dict[word].freq += word_dict_striped[word].freq
f.write( "%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq))
else:
word_dict.update(word_dict_striped[word])
print "Number of words: ",
print(len(word_dict))
for x in word_dict:
print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq
return word_dict
This solution works in a reasonable time till the storage of the two dictionaries, whether I write in two textfiles line by line to avoid any storage or I store them as dict objects in the program. But the merging of the two dictionaries never ends!
The function 'update' for dictionaries would work but override one frequency count instead of adding the two. I saw some solutions of merging dictionaries with addition with Counter: Python: Elegantly merge dictionaries with sum() of values Merge and sum of two dictionaries How to sum dict elements How to merge two Python dictionaries in a single expression? Is there any pythonic way to combine two dicts (adding values for keys that appear in both)? but they seem to work only when the dictionaries are of the form (word, count) whereas I want to carry the other fields in the dictionary as well.
I am open to all your ideas or reframing of the problem, since my goal is to have this program running once only to obtain this merged list in a text file, thank you in advance!