Compare each element of CSV file to every element of a different CSV file, and find the most similar elements

Question

I have two CSV files which I need to compare. The first one is called SAP.csv, and the second is SAPH.csv.

SAP.csv has these cells:

Notification    Description
5000000001      Detailed Inspection of Masts (2100mm) (3
5000000002      Ceremonial Awnings-Survey and Load Test
5000000003      HPA-Carry out 4000 hour service routine
5000000004      UxE 8 in Number Temperature Probs for C
5000000005      Overhaul valves

...while, SAPH.csv has these cells:

Notification   Description
4000000015     Detailed Inspection of Masts (2100mm) (3
4000000016     Ceremonial Awnings-Survey and Load Test
4000000017     HPA-Carry out 8000 hour service routine
4000000018     UxE 8 in Number Temperature Probs for C
4000000019     Represerve valves
4000000020     STW System

They are similar, but some lines, like the fourth, (HPA-Carry out 4000 hour service routine vs. HPA-Carry out 8000 hour service routine), are slightly different.

I want to compare each value of SAP.csv against every value of SAPH.csv, and, using cosine similarity, find the most similar lines, so that the output would look something like this (the similarity percentages here are just examples, not what they would actually be):

Description
Detailed Inspection of Masts (2100mm) (3 - 100%
Ceremonial Awnings-Survey and Load Test  - 100%
HPA-Carry out 4000 hour service routine  - 85%
UxE 8 in Number Temperature Probs for C  - 90%
Overhaul valves                          - 0%

Post answer edit

runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')

Traceback (most recent call last):

File "", line 1, in

runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile

execfile(filename, namespace)

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile

exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py", line 31, in

similarity_score = similar(job, description) # Get their similarity

File "C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py", line 14, in similar

similarity = 1-textdistance.Cosine(qval=2).distance(a, b)

File "C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py", line 173, in distance

return self.maximum(*sequences) - self.similarity(*sequences)

File "C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py", line 176, in similarity

return self(*sequences)

File "C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\token_based.py", line 175, in call

return intersection / pow(prod, 1.0 / len(sequences))

ZeroDivisionError: float division by zero

2nd Edit because of solution to the above

So the original request had just two outputs - Description and Similairty score.

Description comes from SAP Similarity comes from the textdistance calc

Can the solution be ammended to the following

Notifcation (this is a 10 digit number which is in the SAP file) Description (as it currently is) Similarity (as it currently is) Notification (this number comes from the SAPH file and would be the one which provides the similarity score)

So an example row output would like this

80000115360 Additional Materials FWD Rope Guard 86.24% 7123456789

This would be along columns A, B, C, D

A, B comes from SAP C is calculated D comes from SAPH

Edit 3

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile

execfile(filename, namespace)

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile

exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py", line 16, in

SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'})

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f

return _read(filepath_or_buffer, kwds)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 429, in _read

parser = TextFileReader(filepath_or_buffer, **kwds)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 895, in init

self._make_engine(self.engine)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1122, in _make_engine

self._engine = CParserWrapper(self.f, **self.options)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1853, in init

self._reader = parsers.TextReader(src, **kwds)

File "pandas/_libs/parsers.pyx", line 490, in pandas._libs.parsers.TextReader.cinit

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py", line 2017, in pandas_dtype

dtype))

TypeError: data type 'string' not understood

Post edit 4 - 25/10/20

Hi, so getting the same error as before I think

This email may contain proprietary information of BAE Systems and/or third parties.

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile

execfile(filename, namespace)

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile

exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py", line 16, in

SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f

return _read(filepath_or_buffer, kwds)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 435, in _read

data = parser.read(nrows)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1139, in read

ret = self._engine.read(nrows)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 2421, in read

data = self._convert_data(data)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 2487, in _convert_data

clean_conv, clean_dtypes)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1705, in _convert_to_ndarrays

cvals = self._cast_types(cvals, cast_type, c)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1808, in _cast_types

copy=True, skipna=True)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 623, in astype_nansafe

dtype = pandas_dtype(dtype)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py", line 2017, in pandas_dtype

dtype))

TypeError: data type 'string' not understood

I picked up on your bit about the delimiter so I uploaded a csv file to repl.it and it looks as though "," is the delimiter.

Therefore have altered the code to suit. When I did that on repl.it it worked.

This is the code I am using

import textdistance

import pandas as pd

def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369

similarity = 1-textdistance.Cosine(qval=2).distance(a, b)

return similarity * 100

Read the CSVs

SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")

SAPH = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP_History.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")

Create a pandas dataframe to store the output. The column 'Description' is populated with the values of SAP['Description']

scores = pd.DataFrame(SAP['Description'], columns = ['Notification (SAP)','Description', 'Similarity', 'Notification (SAPH)'])

Temporary variable to store the highest similarity score

highest_score = 0

desc = 0

Iterate though SAP['Description']

for job in SAP['Description']:

highest_score = 0 # Reset highest_score in each iteration

for description in SAPH['Description']: # Iterate through SAPH['Description']

similarity_score = similar(job, description) # Get their similarity



if(similarity_score > highest_score): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values

  highest_score = similarity_score

  desc = str(description)

if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.

  break

Update the dataframe 'scores' with highest_score and other values

print(SAPH['Description'][SAPH['Description'] == desc])

scores['Notification (SAP)'][scores['Description'] == job] = SAP['Notification'][SAP['Description'] == job]

scores['Similarity'][scores['Description'] == job] = f'{highest_score}%'

scores['Notification (SAPH)'][scores['Description'] == job] = SAPH['Notification'][SAPH['Description'] == desc]

print(scores)

Output it to Scores.csv without the index column

with open('./Scores.csv', 'w') as file:

file.write(scores.__repr__())

Which is being run on Spyder (Python 3.7)

Please provide sample data for each csv so we can test. Thanks. — Mike67, Sep 11 '20 at 19:47
Just update your post and paste in 10 sample rows from each file (including headers) — Mike67, Sep 12 '20 at 05:57
Apologies Mike, when I go to paste it adds a picture which the server rejects. Obviously something really simple I am missing — Andy Stillwell, Sep 12 '20 at 08:10
Done, but have had to press return as when I have them striaght underneath one another they so in a continuous string — Andy Stillwell, Sep 12 '20 at 17:41
No problem. Paste without extra returns. You can click the code format button `{}` or just paste the data and let someone else do the format. — Mike67, Sep 12 '20 at 17:43
@Mike67 _ugh_, why would you encourage OP to _let someone else do the format_ and not do it yourself instead of giving OP the [link to the help?](/help/formatting) — Pranav Hosangadi, Sep 12 '20 at 18:59
How did you get the numbers you get? `HPA-Carry out 4000 hour service routine` and `UxE 8 in Number Temperature Probs for C` both have a 100% match in SAPH.csv if I've understood correctly that you want to match lines for similarity. — Pranav Hosangadi, Sep 12 '20 at 19:06
Hi @PranavHosangadi, these are essentially work orders. I am trying to come up eventually with a list of what work orders we have carried out on previous projects and to what level of simillairty they have, So those two you have said are 100% done before so we can look up actuals on it. I am trying to do a very small sample set as the full version will have 1,000s of lines. . I am then look to tabulate the the cosine percentages in 5% steps 0-5%, 6-10%, 11-15% etc with counts of how similar the work is — Andy Stillwell, Sep 12 '20 at 19:26
I have made a c&p error the 4000 hour service should have been 8000 in SAPH which I have now changed — Andy Stillwell, Sep 12 '20 at 19:27
@Pranav Hosangadi - Yes - I normally would ask the OP to format the answer, but if it's a problem, I figured I can easily do it. When I said 'someone else', I meant me (unless someone gets to it first) — Mike67, Sep 12 '20 at 20:30

marsnebulasoup · Accepted Answer · 2020-10-23T16:31:43.027

2

@George_Pipas's answer to this question demonstrates an example using the library textdistance (I'm paraphrasing part of his answer here):

A solution is to work with the textdistance library. I will provide an example of Cosine Similarity
import textdistance
1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')
and we get:
0.5

So, we can create a similarity finding function:

def similar(a, b):
    similarity = 1-textdistance.Cosine(qval=2).distance(a, b)     
    return similarity

Depending on the similarity, this'll output a number closer to 1, if a and b are more similar, and it'll output a number closer to 0 if they aren't. So if a === b, the output will be 1, but if a !== b, the output will be less than 1.

To get percentages, you just need to multiply the output by 100. Like this:

def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
    similarity = 1-textdistance.Cosine(qval=2).distance(a, b) 
    return similarity * 100

CSV files can be read pretty easily with pandas:

# Read the CSVs
SAP = pd.read_csv('SAP.csv') 
SAPH = pd.read_csv('SAPH.csv')

We create another pandas dataframe to store the results we'll compute in:

# Create a pandas dataframe to store the output. The column 'SAP' is populated with the values of SAP['Description']
scores = pd.DataFrame({'SAP': SAP['Description']}, columns = ['SAP', 'SAPH', 'Similarity'])

Now, we iterate through SAP['Description'] and SAPH['Description'], compare each element against each other element, compute their similarity, and save the highest to scores.

# Temporary variable to store both the highest similarity score, and the 'SAPH' value the score was computed with
highest_score = {"score": 0, "description": ""}

# Iterate though SAP['Description']
for job in SAP['Description']:
  highest_score = {"score": 0, "description": ""} # Reset highest_score at each iteration
  for description in SAPH['Description']: # Iterate through SAPH['Description']
    similarity_score = similar(job, description) # Get their similarity

    if(similarity_score > highest_score['score']): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values
      highest_score['score'] = similarity_score
      highest_score['description'] = description
    if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.
      break
  # Update the dataframe 'scores' with highest_score
  scores['SAPH'][scores['SAP'] == job] = highest_score['description'] 
  scores['Similarity'][scores['SAP'] == job] = highest_score['score']

Here's a breakdown:

A temporary variable, highest_score is created to store, well, the highest computed scores.
Now we iterate thorough SAP['Description'], and within, iterate though SAPH['Description']. This allows us to compare each value of SAP['Description'] (job) to every value of SAPH['Description'] (description).
While iterating though SAPH['Description'], we:
1. Compute the similarity score of both job and description
2. If it's higher than the saved score in highest_score, we update highest_score accordingly; otherwise we continue
3. If similarity_score is equal to 100, we know that it's a perfect match, and don't have to keep looking. We break the loop in this case.
Outside of the SAPH['Description'] loop, now that we've compared job to each element of SAPH['Description'], (or found a perfect match), we save the values to scores.

This repeats for every element of SAP['Description'].

Here's what scores looks like when it's finished:

                                        SAP                                      SAPH Similarity
0  Detailed Inspection of Masts (2100mm) (3  Detailed Inspection of Masts (2100mm) (3        100
1   Ceremonial Awnings-Survey and Load Test   Ceremonial Awnings-Survey and Load Test        100
2   HPA-Carry out 4000 hour service routine   HPA-Carry out 8000 hour service routine    94.7368
3   UxE 8 in Number Temperature Probs for C   UxE 8 in Number Temperature Probs for C        100
4                           Overhaul valves                         Represerve valves    53.4522

And after outputting it to a CSV file with this:

# Output it to Scores.csv without the index column (0, 1, 2, 3... far left in scores above). Remove index=False if you want to keep the index column.
scores.to_csv('Scores.csv', index=False)

...Scores.csv looks like this:

SAP,SAPH,Similarity
Detailed Inspection of Masts (2100mm) (3,Detailed Inspection of Masts (2100mm) (3,100
Ceremonial Awnings-Survey and Load Test,Ceremonial Awnings-Survey and Load Test,100
HPA-Carry out 4000 hour service routine,HPA-Carry out 8000 hour service routine,94.73684210526315
UxE 8 in Number Temperature Probs for C,UxE 8 in Number Temperature Probs for C,100
Overhaul valves,Represerve valves,53.45224838248488

View the full code, and run and edit it online

Note that textdistance and pandas are required libraries for this. Install them, if you don't have them already, with:

pip install textdistance pandas

Notes:

You can round the percent by replacing f'{highest_score}%' with this: f'{round(highest_score, NUMBER_OF_PLACES_TO_ROUND_TO)}%'
Here's a formatted version, and here's the code

EDIT: (for the problems encountered that are mentioned in the comments)

Here is an error-catching version of the similarity function:

def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
  try: 
    similarity = 1-textdistance.Cosine(qval=2).distance(a, b) 
    return similarity * 100
  except ZeroDivisionError:
    print('There was an error. Here are the values of a and b that were passed')
    print(f'a: {repr(a)}')
    print(f'b: {repr(b)}')
    exit()

edited Oct 23 '20 at 16:31

answered Sep 15 '20 at 00:53

marsnebulasoup

2,530
2
16
37

Hi marsnebula, is there anyway to get the output CSV to be on two columns? Description and Similairty? Thanks – Andy Stillwell Sep 18 '20 at 14:02
[The last link I provided](https://repl.it/@marsnebulasoup/JubilantExcitableApplicationserver-3#main.py) does just that. – marsnebulasoup Sep 18 '20 at 14:07
Do you want it to be separated by commas, not formatted? Because this returns a formatted version. – marsnebulasoup Sep 18 '20 at 14:07
@Andy_Stillwell - Like [this](https://repl.it/@marsnebulasoup/JubilantExcitableApplicationserver-4#Scores.csv)? [Code](https://repl.it/@marsnebulasoup/JubilantExcitableApplicationserver-4#main.py) – marsnebulasoup Sep 18 '20 at 14:12
Hi, when its exported to csv all the data is contained in column A. Is there any way for Column A to have the description and Column B to have the similarity score? – Andy Stillwell Sep 18 '20 at 14:36
But there are two columns outputted by the code in the link above: Description and Similarity. – marsnebulasoup Sep 18 '20 at 14:46
I know, however when I load up the csv file, the data is just on one string under column A – Andy Stillwell Sep 18 '20 at 14:53
It *is* valid so there shouldn't be a problem – marsnebulasoup Sep 18 '20 at 14:55
I am just loading the csv file up in excel and its all in column A. Is there another way how I should be reading it? – Andy Stillwell Sep 18 '20 at 14:56
Let me try that. Give me a min – marsnebulasoup Sep 18 '20 at 14:57
Looks [fine to me](https://i.gyazo.com/476251a95b36a28bb659b6163c163a25.png). Are you copying and pasting the data into excel? If you download the [CSV file](https://repl.it/@marsnebulasoup/JubilantExcitableApplicationserver-4#Scores.csv), and open it with excel it seems to work fine. – marsnebulasoup Sep 18 '20 at 15:00
The program creates a csv file? I am just loading it up and its all in column A? – Andy Stillwell Sep 18 '20 at 15:03
Yes it creates a CSV file, but if you want an excel output, it can create an excel file instead, like [this](https://repl.it/@marsnebulasoup/JubilantExcitableApplicationserver-5#main.py) although you will have to download it to view it. – marsnebulasoup Sep 18 '20 at 15:06
Hi @marsnebulasoup, its me again. Finally got access through the work firewall and downloaded textdistance into python. It worked fine on a 2,300 row data set, however I have just tried to put 13,000 rows through it and an error has come up. – Andy Stillwell Oct 23 '20 at 10:09
File "C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py", line 176, in similarity return self(*sequences) File "C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\token_based.py", line 175, in __call__ return intersection / pow(prod, 1.0 / len(sequences)) ZeroDivisionError: float division by zero – Andy Stillwell Oct 23 '20 at 10:12
If it works fine for one set of CSV files, but not the other, it'd seem like the problem is in the CSVs' content. Perhaps there are some empty rows or something in the CSV data that is screwing the similarity comparison up...are you able to share the CSV files? – marsnebulasoup Oct 23 '20 at 12:42
Also, maybe it could be a formatting issue. Both SAP and SAPH need to have one column with the header 'Description' for pandas to read it properly. One more thing...it doesn't seem like what you've posted is the full stack trace; it doesn't say on what line *in your program* does this fail. Can you edit the question to include the full message or send a link to a pastebin or something? Comments are limited to 500 or so characters, I believe, so there might not be enough room for the full message in a comment. – marsnebulasoup Oct 23 '20 at 12:46
Hi, I have included the full error trace on the original post. I cannot get the csv files out but both are formatted the same way. Notification and Description being the headers. – Andy Stillwell Oct 23 '20 at 16:06
So I am thinking that this could be a problem with the headers. If you change the `Notification` header in the CSV file to `Description`, does it work? That probably isn't the problem, though, because it would most likely throw a KeyError, not a ZeroDivisionError. You need to find out what is being passed to the similarity function that raises this error. I've edited the above answer that I posted to include a similarity function with error catching. If you replace your similarity function with that one, you can see what is being passed to the function that throws the error... – marsnebulasoup Oct 23 '20 at 16:23
...i.e, what from the CSV file is messing the code up. Alternatively, I would think that a solution would be to make sure the values passed to `textdistance` are strings, so you can replace this part of your similarity function: `.distance(a, b)` with `.distance(str(a), str(b))` – marsnebulasoup Oct 23 '20 at 16:26
Hi, that error check has worked thankyou very much. There were two rows that had 3 and . as the description which it kept falling down on. It took my work computer around 5 hours to do a 17k vs 22k row comparison. On receipt of that I think I need two more bits of info if you can help? On the Scores file I need the Notification number to sit alongside the Description and if I am intepretating how it works correctly when it goes through the descriptions contained in SAPH it must go through the entire list and then select the one with the highest %? Could it also return in the scores sheet that – Andy Stillwell Oct 24 '20 at 12:12
notification number? So on ther scores workbook there would be 4 columns. Notification (SAP), Description (SAP), Similarity (Calc), Notification (SAPH) Does that make sense? Thanks Andrew – Andy Stillwell Oct 24 '20 at 12:15
I am not sure what you mean by notification number. Do you mean the indexes (line numbers) of the lines compared? Eg. So if line 3 in SAP and line 5 in SAPH were a match, it would look like this in Scores.csv: `3, Line from SAP, 98%, 5` Is this what you are looking for? If not could you possibly edit your question to include an an example of your desired output - just a few lines of Scores.csv, and explain what each column is supposed to be? – marsnebulasoup Oct 24 '20 at 17:01
Have ammended the original post. Hope that makes sense? – Andy Stillwell Oct 24 '20 at 17:30
@AndyStillwell - Something like [this](https://repl.it/@marsnebulasoup/JubilantExcitableApplicationserver-7#Scores.csv)? Here's the [code](https://repl.it/@marsnebulasoup/JubilantExcitableApplicationserver-7#main.py). It honestly took way longer than I expected but anyways...it should work for you, although I don't know how you want the output to be formatted. – marsnebulasoup Oct 24 '20 at 19:10
Hi, it comes up with an error. I have put the trace in under edit 3 above – Andy Stillwell Oct 24 '20 at 19:42
Okay, this is probably something to do with SAP.csv and SAPH.csv. Are they formatted like SAP.csv and SAPH.csv in the code I linked in the comment above? If not can you provide a sample of the first bit of the file so I can see where it differs from what I have in the code? – marsnebulasoup Oct 24 '20 at 22:14
The notification numbers are just in as a General format on the .csv files – Andy Stillwell Oct 25 '20 at 08:26
What do you mean by "general format"? I don't understand that. What I'm trying do is tell if SAP.csv and SAPH.csv have to exact same structure as SAP.csv and SAPH.csv in [the repl.it link in my comment above](https://repl.it/@marsnebulasoup/JubilantExcitableApplicationserver-7#main.py). If they have a different structure, then the code will fail, so can you edit your post to include like the first 10 or so lines of SAP.csv and SAPH.csv so I can change the code accordingly? Otherwise I'm just guessing how it looks like. – marsnebulasoup Oct 25 '20 at 16:01
Hi, I think I have updated the original post to include notifcations but when I post it into excel it doesnt fill two columns. – Andy Stillwell Oct 25 '20 at 16:16
It seems like each value of SAP and SAPH are separated by six spaces, so I changed the delimiter from the default (comma) to that. However, the headers in SAP and SAPH seem to be separated by three spaces only, so I copied the six whitespaces from a line below, and replaced those three spaces with the six, so all the fields are delimited identically. Alternatively, you could do a find and replace and replace those six spaces with commas, and it should work with the code you already have. But here is the modified code (and CSV files) that work: – marsnebulasoup Oct 25 '20 at 17:10
https://repl.it/@marsnebulasoup/JubilantExcitableApplicationserver-7#SAP.csv – marsnebulasoup Oct 25 '20 at 17:12
So, as I understand it, it runs fine on repl.it but not locally? Is it possible that you didn't copy the code correctly from repl.it to your computer, so that's why it's not working? I don't see why it can't work locally but works online...there's probably some differences between the code online and local. Other than that, you can try changing `dtype={'Notification':'string'}` in SAP and SAPH to `dtype={'Notification':'object'}`, but I'm not sure if that'll make a difference. – marsnebulasoup Oct 25 '20 at 23:05
Also when you edit your post, you can format code by surrounding it with triple backticks (```), or highlighting it, then hitting Ctrl-K or Command-K depending on your OS – marsnebulasoup Oct 25 '20 at 23:07
Hi, have spent a few hours trying to solve it but no luck. Could you try having two csv files with the data and trying to run it? I am still getting the d type error – Andy Stillwell Oct 26 '20 at 14:14
Okay I have tried it [with the space-delimited CSV files](https://repl.it/@marsnebulasoup/JubilantExcitableApplicationserver-7#main.py) and [the comma-delimited ones](https://repl.it/@marsnebulasoup/JubilantExcitableApplicationserver-8#main.py), and they work each time. Since we are using the same code, the only reason it can't be working is that there's something in the CSV files that is screwing things up. Perhaps it has something to do with that line that threw the ZeroDivisionError? Did you remove that line? I don't know how much I can actually help without seeing the CSV files... – marsnebulasoup Oct 26 '20 at 14:30
...but I can write a function that will check if the delimiter changes throughout the CSV files, because that could be a reason it's not working. – marsnebulasoup Oct 26 '20 at 14:31

Compare each element of CSV file to every element of a different CSV file, and find the most similar elements

Read the CSVs

Create a pandas dataframe to store the output. The column 'Description' is populated with the values of SAP['Description']

Temporary variable to store the highest similarity score

Iterate though SAP['Description']

Update the dataframe 'scores' with highest_score and other values

print(SAPH['Description'][SAPH['Description'] == desc])

Output it to Scores.csv without the index column

1 Answers1