2

I have 2 lists from a csv file.

column1 = ZIP codes
column2 = City Name

I have a string; that contains some random text, and possibly zip codes and cities names.

I want to check for each i if column1[i] & colmun2[i] are in the string.

I used this solution but that only checks if words from 1 list are in string. And doesn't returns position from the original list, so can't match on that after for column2.

I ended up using :

for i in range(39000):
if all(map(lambda w: w in text, (column1[i], column2[i]))):
    print(column1[i], column2[i])

But for two 39000 words list, i'm @ around 0.30 seconds, isn't there any process that could go faster. This solution is twice faster (0.13 to 0.17sec), but only search for 1 word...

Any ideas ? Thanks

Edit reprocuctible example :

CSV file

import pandas as pd

column_names = ["code_commune_insee","nom_de_la_commune", "code_postal","ligne_5","libelle_d_acheminement","coordonnees_gps"]

df = pd.read_csv("laposte_hexasmal.csv", names=column_names, delimiter=';')

column1  = df.code_postal.to_list()
column2 = df.nom_de_la_commune.to_list()

column1_short_version_example = ['48283', '43288', '84389', '403294', '84384', '88439']
column2_short_version_example = ['PARIS', 'Paris', 'London', 'Amsterdam', 'Dublin', 'Manchester'] 
text = 'Hello, yes your order is indeed confirmed for the 14th in our hotel neer PARIS 12, the zip code is 43288 or 75012, if you want to book a night in London two weeks after, we still have room avaible, the postal code for the hotel address is 45 road street 84389'

for i in range(len(column1)):
    if all(map(lambda w: w in text, (column1[i], column2[i]))):
        print(column1[i], column2[i])

The wanted result for short lists version is :

43288 Paris
84389 London

The wanted result for csv guiven file lists version is :

75012 PARIS 12
Darkmoon
  • 25
  • 4
  • 1
    please provide a minimal reproducible example – mozway Sep 22 '22 at 08:54
  • Please provide example input for column1, column2 and the random text, and what is the expected output? – Mortz Sep 22 '22 at 09:00
  • Hi Mortz and mozway, thanks for the feedback, I just added a reproducible example. I hope it helps to better understand my newbee problem. Since zip codes can be the same for several small villages, and cities can have several zip codes for different districts, I try to check and get only the right combination of zip code and city. – Darkmoon Sep 22 '22 at 09:08
  • So your inputs are lists? not pandas DataFrame/Series? – mozway Sep 22 '22 at 09:09
  • Inputs are transformed from df to list yes. using df.ZIP_CODE.to_list(). But I can stay in df is easier and faster, raw data is in CSV. The csv database is from french postal group LaPoste https://www.data.gouv.fr/fr/datasets/base-officielle-des-codes-postaux/ – Darkmoon Sep 22 '22 at 09:14

2 Answers2

1

You can loop directly over the items instead of the indices and use the built-in zip function to loop over both the lists simultaneously -

def op(): # this is your solution
    collect = [] # I am collecting into a list instead of print for benchmark
    for i in range(len(column1)):
        if all(map(lambda w: w in text, (column1[i], column2[i]))):
            collect.append((column1[i], column2[i]))
    return collect

def zip_based(): # this is what I am proposing
    collect = [] # I am collecting into a list instead of print for benchmark
    for zipcode, city in zip(column1, column2):
        if zipcode in text and city in text:
            collect.append((zipcode, city))
    return collect

Output YMMV, but I am seeing a ~3X speed-up -

%timeit op()
# 9.93 µs ± 618 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit zip_based()
# 3.01 µs ± 489 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Mortz
  • 4,654
  • 1
  • 19
  • 35
  • WOW. That's exactly what i was looking for, back to less than 0.15sec for the total lookup. Thanks a lot – Darkmoon Sep 22 '22 at 10:01
0

you should try to iterate through the rows of your dataframe

#Suppose df is your dataframe with columns 'ZIP' and 'City'
#Suppose text is your string "that contains some random text, and possibly zip codes and cities names"
for index, row in df.iterrows():
  if(any(row['ZIP'] in text, row['City'] in text)):
   print(f"Row {index} : {row['ZIP']} and {row['City']}")
Clément Perroud
  • 483
  • 3
  • 12
  • Hello Clément, thanks for the answer, but, doesn't any guives you true if ZIP OR CITY is present in the text ? I want TRUE only if both ZIP code and CITY couple on same index are present in text – Darkmoon Sep 22 '22 at 09:19