2

I am trying to work on string matching with python. I have .csv data like:

Item_ID»Item_Name
101»tomato
102»tomatos
103»tomatoes
104»tomato

(The »s represent tab \t characters.)

I want output to be like :

Item_ID»Item_ID1»Similarity
101»102»92
101»103»73
101»104»100
102»101»92
102»103»85
102»104»92
104»101»100
104»102»92
104»103»73

The Item_Name can be same but it has different ID, The similarity should be based on Item_name but to be calculated with each Item_ID. Can you help me with the python code for this?

jpp
  • 159,742
  • 34
  • 281
  • 339
  • That is just for an example. Sorry for not mentioning that. – Rishab Oberoi Mar 31 '18 at 18:15
  • Similarity should be based on the value of Item_name to Item_name. Eg: "Tomato" (101) can be 92% similar to the string "Tomatoes" (102) – Rishab Oberoi Mar 31 '18 at 18:19
  • OK, but how are we meant to *calculate* the value 92%. If we cannot calculate the value, we cannot create a column containing that value. Please update the desired output to something that we can actually calculate through *some replicable logic*. – jpp Mar 31 '18 at 18:21
  • somewhat like this : https://stackoverflow.com/questions/36802453/comparing-two-columns-of-a-csv-and-outputting-string-similarity-ratio-in-another – Rishab Oberoi Mar 31 '18 at 18:22
  • So my solution below uses `difflib.SequenceMatcher`. Hope it helps. – jpp Mar 31 '18 at 18:37

2 Answers2

1

You can read the input into a list of dicts, as in my example.

Then, you can use itertools.combinations(data, 2) to get all the pairs.

import itertools
import random

def get_similarity(obj1, obj2):
    # your own get_similarity function using obj1['name'], obj2['id'], etc
    # here I'm just returning random number
    return random.randint(1, 100)

data = [
    {'id': 101, 'name': 'tomato'},
    {'id': 102, 'name': 'tomatos'},
    {'id': 103, 'name': 'tomatoes'},
    {'id': 104, 'name': 'tomato'},
]

print('Item_ID1', '\t', 'Item_ID2', '\t', 'Similarity')
for obj1, obj2 in itertools.combinations(data, 2):
    similarity = get_similarity(obj1, obj2)
    print(obj1['id'], '\t', obj2['id'], '\t', similarity)

This outputs

Item_ID1     Item_ID2    Similarity
101      102     89
101      103     83
101      104     75
102      103     9
102      104     3
103      104     86

In your sample output, you are repeating the same pair twice (for example, (101, 104) and (104, 101).

If this was intended, you can simply print the same pair twice with the order of objects swapped:

for obj1, obj2 in itertools.combinations(data, 2):
    similarity = get_similarity(obj1, obj2)
    print(obj1['id'], '\t', obj2['id'], '\t', similarity)
    print(obj2['id'], '\t', obj1['id'], '\t', similarity)
Eric
  • 2,635
  • 6
  • 26
  • 66
  • Thanks, Can it be done on .CSV file as well? – Rishab Oberoi Mar 31 '18 at 18:25
  • Also, is it creating random integer or is actually calculating string similarity with any algorithm? – Rishab Oberoi Mar 31 '18 at 18:27
  • @RishabOberoi Yes! You can do it by creating a list, constructing a `dict` with fields 'id' and 'name' for CSV each row, and push the `dict` into the list when you are reading the csv file. I think it would be a great exercise if you started learning Python. I just used random integers, but you should replace it with your similarity calculation logic. – Eric Mar 31 '18 at 18:30
  • 1
    Okay, Thanks for the help Eric. I will try this solution suggested by you. – Rishab Oberoi Mar 31 '18 at 18:32
1

Here is how I would implement your logic via pandas.

import pandas as pd
from difflib import SequenceMatcher
from io import StringIO
from itertools import combinations

mystr = StringIO("""Item_ID Item_Name
101 tomato
102 tomatos
103 tomatoes
104 tomato""")

# replace mystr with 'input.csv'
d = pd.read_csv(mystr, delim_whitespace=True).set_index('Item_ID')['Item_Name'].to_dict()

# create dictionary of results
d_out = {idx: [i, j, SequenceMatcher(None, d[i], d[j]).ratio()] \
              for idx, (i, j) in enumerate(combinations(d, 2))}

# create dataframe from dictionary
res = pd.DataFrame.from_dict(d_out, orient='index')

# rename columns
res.columns = ['Item_ID', 'Item_ID1', 'Similarity']

# output to csv
res.to_csv('result.csv', index=False)

Result:

   Item_ID  Item_ID1  Similarity
0      101       102    0.923077
1      101       103    0.857143
2      101       104    1.000000
3      102       103    0.933333
4      102       104    0.923077
5      103       104    0.857143
jpp
  • 159,742
  • 34
  • 281
  • 339