How to lookup data from one CSV in another CSV?

Question

In the crq_data file I have cities and states from a user uploaded *.csv file
In the cityDoordinates.csv file I have a library of American cities and states along with their coordinates, I would like this to be a sort of "look up tool" to compare an uploaded .csv file to find their coordinates to map in Folium

Right now, it reads line by line so it appends the coordinates one at a time (n seconds) I would like it to run much faster so that if there are 6000 lines the user doesn't have to wait for 6000 seconds.

Here is part of my code:

crq_file = askopenfilename(filetypes=[('CSV Files', '*csv')])
crq_data = pd.read_csv(crq_file, encoding="utf8")
coords = pd.read_csv("cityCoordinates.csv")

for crq in range(len(crq_data)):
    task_city = crq_data.iloc[crq]["TaskCity"]
    task_state = crq_data.iloc[crq]["TaskState"]

    for coordinates in range(len(coords)):
        cityCoord = coords.iloc[coordinates]["City"]
        stateCoord = coords.iloc[coordinates]["State"]
        latCoord = coords.iloc[coordinates]["Latitude"]
        lngCoord = coords.iloc[coordinates]["Longitude"]

        if task_city == cityCoord and task_state == stateCoord:
            crq_data["CRQ Latitude"] = latCoord
            crq_data["CRQ Longitude"] = lngCoord
                
            print(cityCoord, stateCoord, latCoord, lngCoord)

This is an example of the current Terminal Output

Example of uploaded .csv file

it looks like your'e trying to do a spatial join by aligning data or geometries between two dataframes. I'm not exactly sure how your merge is happening, but likely either [`pd.merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) or [`geopandas.sjoin`](https://geopandas.org/en/stable/docs/reference/api/geopandas.sjoin.html) will do the trick, and will probably be a couple orders of magnitude faster than the double for loop implementation. — Michael Delgado, Jul 05 '22 at 21:30
Please have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and [edit] your question to include a sample of you input dataframes and your expected output as text in the body of the question, not as images or links, to make a [mcve] so that we can better understand how to help — G. Anderson, Jul 05 '22 at 22:58

Zach Young · Accepted Answer · 2022-07-05T23:31:52.253

I see this not as a problem w/optimizing Pandas, but finding a good data structure for fast lookups: and a good data structure for fast lookups is the dict. The dict takes memory, though; you'll need to evaluate that cost for yourself.

I mocked up what your cityCoordinates CSV could look like:

| City     | State | Latitude   | Longitude   |
|----------|-------|------------|-------------|
| Portland | OR    | 45°31′12″N | 122°40′55″W |
| Dallas   | TX    | 32°46′45″N | 96°48′32″W  |
| Portland | ME    | 43°39′36″N | 70°15′18″W  |

import csv
import pprint


def cs_key(city_name: str, state_name: str) -> str:
    """Make a normalized City-State key."""
    return city_name.strip().lower() + "--" + state_name.strip().lower()


# A dict of { "City_name-State_name": (latitude, longitude), ... }
coords_lookup = {}

with open("cityCoordinates.csv", newline="") as f:
    reader = csv.DictReader(f)  # your coords file appears to have a header
    for row in reader:
        city = row["City"]
        state = row["State"]
        lat = row["Latitude"]
        lon = row["Longitude"]

        key = cs_key(city, state)
        coords_lookup[key] = (lat, lon)


pprint.pprint(coords_lookup, sort_dicts=False)

When I run that, I get:

{'portland--or': ('45°31′12″N', '122°40′55″W'),
 'dallas--tx':   ('32°46′45″N', '96°48′32″W'),
 'portland--me': ('43°39′36″N', '70°15′18″W')}

Now, iterating the task data looks pretty much the same: we take a pair of City and State, make a normalized key out of them, then try to look up that key for known coordinates.

I mocked up some task data:

| TaskCity   | TaskState |
|------------|-----------|
| Portland   | OR        |
| Fort Worth | TX        |
| Dallas     | TX        |
| Boston     | MA        |
| Portland   | ME        |

and when I run this:

with open("crq_data.csv", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        city = row["TaskCity"]
        state = row["TaskState"]

        key = cs_key(city, state)
        coords = coords_lookup.get(key, (None, None))
        if coords != (None, None):
            print(city, state, coords[0], coords[1])

I get:

Portland OR 45°31′12″N 122°40′55″W
Dallas TX 32°46′45″N 96°48′32″W
Portland ME 43°39′36″N 70°15′18″W

This solution is going to be much faster in principle because you're not doing a cityCoordinates-ROWS x taskData-ROWS quadratic loop. And, in practice, Pandas suffers when doing row iteration^1, I'm not sure if the same holds for indexing (iloc), but in general Pandas is for manipulating columns of data, and I would say is not for row-oriented problems/solutions.

How to lookup data from one CSV in another CSV?

1 Answers1