How to check if two strings from two files are the same faster/more efficient

Question

I have this simple task. I have two excel files with company names. The files are pretty large(about 170k rows). My task is to take company name in one file and print all identical ones from another. So for example we have Table A:

id	company name
0	Born SA
1	ToBeBorn SA
2	Ice SA
3	Icey SA

and table B:

id	company name
0	Born SA
1	ToBeBornInEU SA
2	IceCake SA
3	Icey SA

And I want to find identical names from Table A in Table B. So the return will be like this: Born S.A. Icey S.A.
This is simple task. My code looks like this:

import pandas as pd
clients_a = pd.read_excel("excel_file_number1")
clients_b = pd.read_excel("excel_file_number2")
for clientA in clients_a["Clients"]:
 for clientB in clients_b["Clients"]:
  if clientA.lower() == clientB.lower():
   print(clientA)

I use lower because the same company may have different entry. In table A it may be Ice SA but in table B It's ICE SA, but It's still the same company. My question is, how can I make this faster/more efficient ? Not gonna lie it takes a lot of time, but I don't have any idea how can I sped it up. it's a simple task so There must be a way, but I don't know how. Any help would be greatly appreciated!

You can implement a trie, or build digests of strings into a hash backed map and then do O(1) lookups. — Oyster773, Feb 04 '22 at 00:24
Also https://stackoverflow.com/questions/53645882/pandas-merging-101 — BigBen, Feb 04 '22 at 00:25
@BigBen almost, but my problem is that I have different column names. — neekitit, Feb 04 '22 at 00:31
The second link clearly covers the case of different column names. — BigBen, Feb 04 '22 at 00:32

score 0 · Answer 1 · answered Feb 04 '22 at 00:30

0

Use set(). If you fill two sets, you can find the intersection.

answered Feb 04 '22 at 00:30

Ed Behn

450
2
10

score 0 · Answer 2 · answered Feb 04 '22 at 00:34

0

This should be faster although I haven't tested it:

clients_a["lower"] = clients_a["Clients"].lower()
clients_b["lower"] = clients_b["Clients"].lower()

clients_a["lower"].apply(lambda x: (clients_b["lower"] == x).any())

answered Feb 04 '22 at 00:34

nnsk

93
5

How to check if two strings from two files are the same faster/more efficient

2 Answers2