2

I have a textfile with the following format:

word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency

... with 1 million items

But some of the word_forms contain an apostrophe ('), others do not, so I would like to count them as instances of the same word, that's to say I would like to merge lines like these two:

cup'board   cup     blabla  12
cupboard    cup     blabla2 10

into this one (frequencies added):

cupboard    cup     blabla2  22

I am searching a solution in Python 2.7 to do that, my first idea was to read the textfile, store in two different dictionaries the words with apostrophe and the words without, then go over the dictionary of words with apostrophe, test if these words are already in the dictionary without apostrophe, if they are actualise the frequency, if not simply add this line with apostrophe removed. Here is my code:

class Lemma:
    """Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus"""
    def __init__(self,lop):
        self.word_form = lop[0]
        self.root = lop[1]
        self.morph = lop[2]
        self.freq = int(lop[3])

def Reader(filename):
    """Keeps the lines of a file in memory for a single reading, memory efficient"""
    with open(filename) as f:
        for line in f:
            yield line

def get_word_dict(filename):
    '''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe'''
    '''Works in a reasonable time'''
    '''This step can be done writing line by line, avoiding all storage in memory'''
    word_dict = {}
    word_dict_striped = {}

    # We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe   
    with open('word_dict.txt', 'wb') as f:
        with open('word_dict_striped.txt', 'wb') as g:

            reader = Reader(filename)
            for line in reader:
                items = line.split("\t")
                word_form = items[0]
                if "'" in word_form:
                    # we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped
                    items[0] = word_form.replace("'","")
                    items[2] = items[2].replace("\+Apos", "")

                    g.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
                    word_dict_striped({items[0] : Lemma(items)})
                else:
                    # we just add the lemma to the dictionary word_dict
                    f.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
                    word_dict.update({items[0] : Lemma(items)})

    return word_dict, word_dict_striped

def merge_word_dict(word_dict, word_dict_striped):
    '''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key'''
    ''' Does not run in reasonable time on the whole list '''

    with open('word_compiled_dict.txt', 'wb') as f:

        for word in word_dict_striped.keys():
            if word in word_dict.keys():
                word_dict[word].freq += word_dict_striped[word].freq
                f.write( "%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq))
            else:
                word_dict.update(word_dict_striped[word])

    print "Number of words: ",
    print(len(word_dict))

    for x in word_dict:
        print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq

    return word_dict

This solution works in a reasonable time till the storage of the two dictionaries, whether I write in two textfiles line by line to avoid any storage or I store them as dict objects in the program. But the merging of the two dictionaries never ends!

The function 'update' for dictionaries would work but override one frequency count instead of adding the two. I saw some solutions of merging dictionaries with addition with Counter: Python: Elegantly merge dictionaries with sum() of values Merge and sum of two dictionaries How to sum dict elements How to merge two Python dictionaries in a single expression? Is there any pythonic way to combine two dicts (adding values for keys that appear in both)? but they seem to work only when the dictionaries are of the form (word, count) whereas I want to carry the other fields in the dictionary as well.

I am open to all your ideas or reframing of the problem, since my goal is to have this program running once only to obtain this merged list in a text file, thank you in advance!

Community
  • 1
  • 1
hajoki
  • 77
  • 4
  • Can't you simply replace all the apostrophes by an empty string to remove them? Like so: `word_form = items[0].replace("'", "")` – Sven Rusch Dec 05 '16 at 12:20
  • But then I'll have two lines with the same word and these frequencies won't be added, right? – hajoki Dec 05 '16 at 12:40
  • Are there at most two lines that may be combined for a given word, or possibly more? Are the ones that need to be combined necessarily next to each other? If two lines are to be combined, is everything else (besides the counts) guaranteed to be the same? – Iluvatar Dec 05 '16 at 12:53
  • Yes there are at most two lines that may be combined for a given word, only a version with apostrophe, and a version without. But no, the ones to be combined are not necessarily next to each other. And no, if two lines are combined, the 3rd column is actually different but ideally the one from the line without apostrophe should be conserved (as shown in the example) – hajoki Dec 05 '16 at 13:22
  • Oh, one more thing, are there any apostrophes in places other than the first word? (i.e. would it be okay to start by just replace them all with the empty string as Sven said) – Iluvatar Dec 05 '16 at 13:42
  • No apostrophes are only in the first column, thanks for your interest in the problem – hajoki Dec 05 '16 at 13:50
  • I assume you're not particularly tied to Python, and that this is a one time thing. If this next part works right, I'll post an answer to finish it, but I want to try removing apostrophes and then sorting the file to make things easier. First do `sed "s/'//" filename >newfile`, then `sort newfile >newfile2`. newfile2 contains the sorted words (you can remove newfile), and hopefully it doesn't take too long to finish :) – Iluvatar Dec 05 '16 at 13:56
  • Sorry for the stupid question, but you mean executing these commands in the console? – hajoki Dec 05 '16 at 13:59
  • Right... sorry idk why I assumed you were using bash. If you are, then yes in console/terminal/whatever. If you're on a PC then... hold on a sec. – Iluvatar Dec 05 '16 at 14:00

1 Answers1

0

Here's something that does more or less what you want. Just change the file names at the top. It doesn't modify the original file.

input_file_name = "input.txt"
output_file_name = "output.txt"

def custom_comp(s1, s2):
    word1 = s1.split()[0]
    word2 = s2.split()[0]
    stripped1 = word1.translate(None, "'")
    stripped2 = word2.translate(None, "'")

    if stripped1 > stripped2:
        return 1
    elif stripped1 < stripped2:
        return -1
    else:
        if "'" in word1:
            return -1
        else:
            return 1

def get_word(line):
    return line.split()[0].translate(None, "'")

def get_num(line):
    return int(line.split()[-1])

print "Reading file and sorting..."

lines = []
with open(input_file_name, 'r') as f:
    for line in sorted(f, cmp=custom_comp):
        lines.append(line)

print "File read and sorted"

combined_lines = []

print "Combining entries..."

i = 0
while i < len(lines) - 1:
    if get_word(lines[i]) == get_word(lines[i+1]):
        total = get_num(lines[i]) + get_num(lines[i+1])
        new_parts = lines[i+1].split()
        new_parts[-1] = str(total)
        combined_lines.append(" ".join(new_parts))
        i += 2
    else:
        combined_lines.append(lines[i].strip())
        i += 1

print "Entries combined"
print "Writing to file..."

with open(output_file_name, 'w+') as f:
    for line in combined_lines:
        f.write(line + "\n")

print "Finished"

It sorts the words and messes up the spacing a bit. If that's important, let me know and it can be adjusted.

Another thing is that it sorts the whole thing. For only a million lines, the probably won't take overly long, but again, let me know if that's an issue.

Iluvatar
  • 1,537
  • 12
  • 13
  • Thanks a lot for your answer that turns in less than one minute! I modified it a bit to have the entry also inserted without apostrophe even when there is no entry with apostrophe to merge with, and I realised I have to run the program several times because of some cases where there are more than two lines to merge (my bad, I didn't know there was), but having a program that finishes changes everything! – hajoki Dec 06 '16 at 15:46