2

I have a long dataframe of (about 400) people who participate in an event. You see the first 9 rows below. This event is about cooking at home. As a host, people come to your place and you will cook either appetizer, main dish or dessert for your guests. The location of each host is given by the latitude, longitude and the floor:

d = {'latitude': [1.1, 1.1,1.2,1.3,1.3,1.4,1.4,1.4,1.4],
     'longitude': [1.1, 1.1, 2, 1.3, 1.3, 1.8, 1.8, 1.8, 1.8],
     'floor': [2,3,1,4,4,3,3,3,1],
     'host': ['appetizer', 'appetizer', 'appetizer', 'main', 'main', 'main', 'dessert', 'dessert', 'dessert']}
df = pd.DataFrame(data=d)

The length of the df is always a multiple of 9. For example 27. The dataframe columns are already sorted by size, meaning that people who live close together (share same spatial coordinates) are closer in index value.

enter image description here

Goal:

  • I want to divide the df in clusters of 9 people. (This is why length of df is multiple of 9).
  • In each cluster I want 3 people to prepare the appetizer, main dish and dessert, respectively.

This itself is simple, as you can already see in my dataframe. The challenge comes with the constraint.

Constraint:

  • People who share the same dish (appetizer, main, dessert) should ideally not live at the same spatial coordinates (lat., lon., floor). Of course this is not always possible, then this constraint can be neglected.

In the example index 0, 1, 3 are correct since all live at different places. However, index 3 and 4 live together as well as index 6 and 7. Therefore, the choice of setting the hosts values like it is, was not smart.

Any idea how to set the column "host" so that the constraint is fulfilled?

EDIT 3:

import pandas as pd
from math import radians, sin, cos, atan2, sqrt
from random import random

def distance(lat1, lon1, lat2, lon2):
    R = 6373.0 # approximate radius of earth in km
    lat1 = radians(lat1)
    lon1 = radians(lon1)
    lat2 = radians(lat2)
    lon2 = radians(lon2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = (R * c)
    return distance # in unit 'kilometers'

d = {
        'latitude': [1.1,1.2,1.1, 1.3,1.3,1.4,1.4,1.4,1.4],
        'longitude': [1.1, 2, 1.1, 1.3, 1.3, 1.8, 1.8, 1.8, 1.8],
        'floor': [3,1,2,2,4,3,2,1,4],
        'id': [1,2,3,4,5,6,7,8,9]
    }

df = pd.DataFrame(data=d)
df.set_index('id', drop=False, inplace=True)

df['host'] = 'None'
df['group'] = 'None'

max_distance = 85

# emptyGroupList = QUERY: "SELECT people WHERE group = None ORDER BY lat,long, floor"
emptyGroupList = df.loc[df['host'] == 'None'].sort_values(['latitude', 'longitude', 'floor'], ascending=True)

print(emptyGroupList)

group = []
groupNumber = 0

for index, person in df.iterrows():
  if len(group) == 3:
    for group_person in group:
      df.at[group_person['id'], 'group'] = groupNumber
    groupNumber += 1
    group = []

  if len(group) == 0:
    group.append(person)
  else:
    for group_person in group:
      #print(group_person['floor'])
      dist = distance(person['latitude'], person['longitude'], group_person['latitude'], group_person['longitude'])
      if group_person['floor'] == 1:
        dist=dist+1
      elif group_person['floor'] == 2:
        dist=dist+2
      elif group_person['floor'] == 3:
        dist=dist+3
      elif group_person['floor'] == 4:
        dist=dist+4
      else:
        dist=dist+5

      if 0 < dist <= max_distance:
        group.append(person)


print(df)
  • Note, that I added the line group=[ ] to your code-example!

This is how the data looks at the start:

enter image description here

After applying the algorithm the first time:

enter image description here

  1. This looks wrong since index 4 and 5 share same location. It is strange, because we actually say in the code, that the distance should be larger 0.
  2. Applying the for-loop multiple times does not effect all person in the group. That means, at the end, people are without a group. My goal is, that every person gets a group (appertizer, main, dessert).
  3. Maybe it is better to wrapp everything into a function and apply it multiple times until condition is satisfied? But before doing that, issue (1) and (2) need to be solved.
  4. You said there are 2 ways to include a new person to a group. I do not have any preference. The simples way is the best way.
PParker
  • 1,419
  • 2
  • 10
  • 25
  • 1
    Currently the `distance` function doesn't take into account the `floor`. You can either add `floor` to the `distance` as a 3rd coordinate OR check the `floor` manually in the line `if 0 < dist <= max_distance:`. – VictorDDT Dec 30 '19 at 00:33
  • Thanks Victor! It's not that simple. First I changed the 'dist-function'. Then I checked 'floor' manually (see edited version above). It always runs into an endless loop! I debugged it, but still it is not obvious to me what happens. However, I do not see, how all this solves issue number (2). --> I have slightly changed the input data. Now, every person shoul find / get one group. – PParker Jan 02 '20 at 15:11

1 Answers1

2

I would think about a function which groups people by 3 not by 9 into one group. People in the group shall meet the following requirements:

  1. they don't belong to any group yet
  2. their coordinates are close enough
  3. their coordinates are not the same (people are not in the same place)

here is a pseudo code:

FUNC assignGroup()

    emptyGroupList = QUERY: "SELECT people WHERE group = None ORDER BY lat, long, floor"
    group = []
    groupNumber = 0

    FOR EACH person in emptyGroupList LOOP
        IF group.size == 3 THEN
            QUERY: SET group = groupNumber FOR persons in group
            groupNumber += 1
            group = []
        END IF

        IF group.size == 0 THEN
            group.append(person)
        ELSE
            IF ("person" meets requirements 2 and 3 with people in the "group") THEN
                group.append(person)
        END IF
    END FOR

END FUNC

Probably you will need to run the function several times to cover all persons. Then you can assign any roles (hosts) inside each group by groupNumber as you wish.

Using such approach will give you more control how people are divided into groups and you can put more complex logic there.

Hopefully it makes sense.

EDIT 1

Here is the sample code:

import pandas as pd
from math import radians, sin, cos, atan2, sqrt

def distance(lat1, lon1, lat2, lon2):
    R = 6373.0 # approximate radius of earth in km
    lat1 = radians(lat1)
    lon1 = radians(lon1)
    lat2 = radians(lat2)
    lon2 = radians(lon2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c 
    return distance # in unit 'kilometers'

d = {
        'latitude': [1.1,1.2,1.1, 1.3,1.3,1.4,1.4,1.4,1.4],
        'longitude': [1.1, 2, 1.1, 1.3, 1.3, 1.8, 1.8, 1.8, 1.8],
        'floor': [3,1,2,4,4,3,3,3,1],
        'id': [1,2,3,4,5,6,7,8,9]
    }

df = pd.DataFrame(data=d)
df.set_index('id', drop=False, inplace=True)

df['host'] = 'None'
df['group'] = 'None'

max_distance = 85

# emptyGroupList = QUERY: "SELECT people WHERE group = None ORDER BY lat,long, floor"
emptyGroupList = df.loc[df['host'] == 'None'].sort_values(['latitude', 'longitude', 'floor'], ascending=True)

print(emptyGroupList)

group = []
groupNumber = 0

for index, person in df.iterrows():
    if len(group) == 3:
        for group_person in group:
            df.at[group_person['id'], 'group'] = groupNumber
        groupNumber += 1

    if len(group) == 0:
        group.append(person)
    else: 
        for group_person in group:
            dist = distance(person['latitude'], person['longitude'], group_person['latitude'], group_person['longitude'])
            if 0 < dist <= max_distance:
                group.append(person)


print(df)

Don't forget consider that there are 2 ways to include a new person to a group. You can count distance with the nearest person or with all persons in the group. Refer to the picture below. If person 1 and 2 are in the group. So person "A" and person "B" will have different conditions to be included to the same group.

enter image description here

VictorDDT
  • 583
  • 9
  • 26
  • Awesome and thanks a lot. I have edited my question according to what you suggested. I do not know how to implement the step 2. Maybe you can help me with that? I have prepared a (working?) example. Not sure if my code works so far... – PParker Dec 26 '19 at 21:44
  • 1
    I would consider `(lat, long, floor)` as a 3D coordinates of people. So you just need to [calculate a distance in 3D space](https://www.engineeringtoolbox.com/distance-relationship-between-two-points-d_1854.html). [Here](https://stackoverflow.com/questions/1369512/converting-longitude-latitude-to-x-y-on-a-map-with-calibration-points) is a suggestion from @Gubatron how to convert `(lat, long)` to flat coordinates. Of course, you can use any other approach here if it doesn't meet your goal. – VictorDDT Dec 26 '19 at 22:00
  • Thanks Victor! In my particular case, the distance give by the function 'distance' works fine. I do not need a 3D distance. The problem that I have, is to transform your pseudo-code into real code. As you can see, I struggle with constrain number (2). According to you, in this step, I have to compare the distance of each person in the group with every other person. It is not clear to me, to put this into real code... It is all about this: – PParker Dec 26 '19 at 22:18
  • 1
    @PParker look at the EDIT1 in my answer – VictorDDT Dec 27 '19 at 06:49
  • Awesome! Thank you very much for your help Victor! I have added some comments in my first post (EDIT 2). Sorry, for being so picky with that. This algorithm is the most crucial part of my work therefore I have to get this right! Maybe you can comment on my issues above? Thanks a lot in advance! – PParker Dec 27 '19 at 18:23