0

I have attached a sample of my dataset. I have minimal Panda experience, hence, I'm struggling to formulate the problem.

enter image description here

What I'm trying to do is populate the 'dist' column (cartesian: p1 = (lat1,long1) ; p2 = (lat2,long2) ) for each index based on the state and the county.

Each county may have multiple p1's. We use the one nearest to p2 when computing the distance. When a county doesn't have a p1 value, we simply use the next one that comes in the sequence.

How do I set up this problem concisely? I can imagine running an iterator over the the county/state but failing to move beyond that.

[EDIT] Here is the data frame head as suggested below. (Ignore the mismatch from the picture)

   lat1 long1 state           county   lat2  long2
0     .     .    AK   Aleutians West   11.0   23.0
1     .     .    AK     Wade Hampton   33.0   11.0
2     .     .    AK      North Slope   55.0   11.0
3     .     .    AK  Kenai Peninsula   44.0   11.0
4     .     .    AK        Anchorage   11.0   11.0
5     1     2    AK        Anchorage    NaN    NaN
6     .     .    AK        Anchorage   55.0   44.0
7     3     4    AK        Anchorage    NaN    NaN
8     .     .    AK        Anchorage    3.0    2.0
9     .     .    AK        Anchorage    5.0   11.0
10    .     .    AK        Anchorage   42.0   22.0
11    .     .    AK        Anchorage   11.0    2.0
12    .     .    AK        Anchorage  444.0    1.0
13    .     .    AK        Anchorage    1.0    2.0
14    0     2    AK        Anchorage    NaN    NaN
15    .     .    AK        Anchorage    1.0    1.0
16    .     .    AK        Anchorage  111.0   11.0
srkdb
  • 775
  • 3
  • 15
  • 28
  • 2
    Please no images of data, post `df.head()` or `df.head(15)` or whatever, but no pictures. – SpghttCd Nov 19 '18 at 01:54
  • @SpghttCd: The data is deliberately scrambled. It doesn't reflect actual values. Thanks for the note, though. Also may I ask why? – srkdb Nov 19 '18 at 01:58
  • 1
    Because people who try to help almost always firstly want to play with the data in question. So copy/pasteable code and data is just nice for people who are asked for help. – SpghttCd Nov 19 '18 at 02:05
  • Oh I see. I've added the data head now. Thanks. – srkdb Nov 19 '18 at 02:10

1 Answers1

1

Here's how I would do it using Shapely, the engine underlying Geopandas, and I'm going to use randomized data.

from shapely.geometry import LineString
import pandas as pd
import random


def gen_random():
  return [random.randint(1, 100) for x in range(20)]

j = {"x1": gen_random(), "y1": gen_random(),
     "x2": gen_random(), "y2": gen_random(),}
df = pd.DataFrame(j)


def get_distance(k):
  lstr = LineString([(k.x1, k.y1,), (k.x2, k.y2) ])
  return lstr.length

df["Dist"] = df.apply(get_distance, axis=1)

Shapely: http://toblerity.org/shapely/manual.html#introduction Geopandas: http://geopandas.org/

Charles Landau
  • 4,187
  • 1
  • 8
  • 24
  • I understand this helps with the high level idea. Would it be able to incorporate the other nuances of the problem, like checking the nearest one etc.? Also, what if I am computing the Haversine distance instead of the cartesian one? – srkdb Nov 19 '18 at 02:20
  • If you work at it, you can use the `length` attribute of a `LineString` object for any analysis that requires cartesian length. But that's beyond the scope of your original question and if the response addresses your original question then please mark it as answered – Charles Landau Nov 19 '18 at 02:22
  • Haversine requires more data than length, but nearest should be doable. Again, those are questions beyond the scope of the original question, so please mark the response as answered if it addresses your post – Charles Landau Nov 19 '18 at 02:31
  • @db18 refer to this thread for more on haversine https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points – Charles Landau Nov 19 '18 at 02:33