Group strings with values in Python

Question

I'm working on twitter hashtags and I've already counted the number of times they appear in my csv file. My csv file look like:

GilletsJaunes, 100
Macron, 50
gilletsjaune, 20
tax, 10

Now, I would like to group together 2 terms that are close, such as "GilletsJaunes" and "gilletsjaune" using the fuzzywuzzy library. If the proximity between the 2 terms is greater than 80, then their value is added in only one of the 2 terms and the other is deleted. This would give:

GilletsJaunes, 120
Macron, 50
tax, 10

For use "fuzzywuzzy":

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzz.ratio("GiletsJaunes", "giletsjaune")
82 #output

What have you tried so far? Please show your attempt so that we can help you correct it. — Soviut, Mar 07 '19 at 22:28

Wok · Answer 1 · 2019-04-27T10:58:08.557

First, copy these two functions to be able to compute the argmax:

# given an iterable of pairs return the key corresponding to the greatest value
def argmax(pairs):
    return max(pairs, key=lambda x: x[1])[0]


# given an iterable of values return the index of the greatest value
def argmax_index(values):
    return argmax(enumerate(values))

Second, load the content of your CSV into a Python dictionary and proceed as follows:

from fuzzywuzzy import fuzz

input = {
    'GilletsJaunes': 100,
    'Macron': 50,
    'gilletsjaune': 20,
    'tax': 10,
}

threshold = 50

output = dict()
for query in input:
    references = list(output.keys()) # important: this is output.keys(), not input.keys()!
    scores = [fuzz.ratio(query, ref) for ref in references]
    if any(s > threshold for s in scores):
        best_reference = references[argmax_index(scores)]
        output[best_reference] += input[query]
    else:
        output[query] = input[query]

print(output)

{'GilletsJaunes': 120, 'Macron': 50, 'tax': 10}

score 0 · Answer 2 · answered Mar 07 '19 at 22:51

This solves your problem. You can reduce your input sample by first converting your tags to lowercase. I'm not sure how fuzzywuzzy works, but I would suspect that "HeLlO" and "hello" and "HELLO" are always going to be greater than an 80, and they represent the same word.

import csv
from fuzzywuzzy import fuzz

data = dict()
output = dict()
tags = list()

with open('file.csv') as csvDataFile:
    csvReader = csv.reader(csvDataFile)
    for row in csvReader:
        data[row[0]] = row[1]
        tags.append(row[0])

for tag in tags:
    output[tag] = 0
    for key in data.keys():
        if fuzz.ratio(tag, key) > 80:
            output[tag] = output[tag] + data[key]

Group strings with values in Python

2 Answers2