4

I have a dataframe as follows:

df

         KEY    NAME      ID_LOCATION                                                    _GEOM
0       61196   name1    [(u'-88.121429', u'41.887726')]                            [[[lon00,lat00],[lon01, lat01]]]        
1       61197   name2    [(u'-75.161934', u'38.725163')]                            [[[lon10,lat10], [lon11,lat11],...]]
2       61199   name3    [(u'-88.121429', u'41.887726'), (-77.681931, 37.548851)]   [[[lon20, lat20],[lon21, lat21]]]

where id_loc is a list of tuples. How can I groupby id_loc in a way that if there is a matching (lon, lat) pair, merge those 2 rows and other columns by separated by comma.

expected_output_df

      KEY             NAME             ID_LOCATION                                                   _GEOM
0    61196,61199   name1,name3    [(u'-85.121429', u'40.887726'), (-77.681931, 37.548851)]       [[[lon00, lat00],[lon01, lat01],[lon20, lat20],[lon21, lat21]]]        
1    61197         name2          [(u'-72.161934', u'35.725163')]                                [[[lon10,lat10], [lon11,lat11],...]]

I tried the following but no success and gives me error as unhashable type list:

def f(x):
    return pd.Series(dict(KEY='{%s}' % ', '.join(x['KEY']),
                          NAME='{%s}' % ', '.join(x['NAME']),
                          ID_LOCATION='{%s}' % ', '.join(x['ID_LOCATION']),
                          _GEOM='{%s}' % ', '.join(x['_GEOM']))
                     )
df = df.groupby('ID_LOCATION').apply(f) 
Atihska
  • 4,803
  • 10
  • 56
  • 98
  • just to be clear: your `id_loc` has two different data types: `unicode` and `float`? Why don't you convert them into one unified type? – Quang Hoang Apr 22 '19 at 17:59
  • https://stackoverflow.com/questions/39703165/join-dataframes-by-column-values-pandas – Tokci Apr 22 '19 at 18:04
  • @QuangHoang How do you do that? – Atihska Apr 22 '19 at 18:05
  • @Tokci I think that's different. – Atihska Apr 22 '19 at 18:05
  • @Atihska did your question get answered? – Antoine Zambelli May 07 '19 at 01:00
  • @AntoineZambelli Not really. I am not sure if your code is merging any duplicates or not. The size of data remained same so I have to heck if my data contained no duplicates at all which is unlikely. My data is huge. Wasn't looking at it since a while but now getting back to it and verifying. Will update this week. – Atihska May 07 '19 at 23:21

2 Answers2

2

I think this should work.

First convert things into lists of the same type (so that sum will append things together).

df = pd.DataFrame(
    [[['61196'], ['name1'], [('-88.121429', '41.887726')]], [['61197'], ['name2'], [('-75.161934', '38.725163')]], [['61199'], ['name3'], [('-88.121429', '41.887726'), ('-77.681931', '37.548851')]]],
    columns=['KEY', 'NAME', 'id_loc']
)

Then get pairwise combinations of rows (for id_loc) - ie, pairs of rows to merge together.

# Loop through all pairwise combination of rows (will need index so loop over range() instead of raw values).
to_merge = []  # list of index-tuples, rows to merge together.
for i, j in itertools.combinations(range(len(df['id_loc'].values)), 2):
    a = df['id_loc'].values[i]
    b = df['id_loc'].values[j]

    # Check for shared elemnts.
    if not set(a).isdisjoint(b):
        # Shared elements found.
        to_merge.append([i,j])

Now handle the case where there are 3 or more rows, ie to_merge = [[1, 2], [2, 3]] should be to_merge = [[1, 2, 3]].

def find_intersection(m_list):
    for i,v in enumerate(m_list) : 
        for j,k in enumerate(m_list[i+1:],i+1):  
           if v & k:
              s[i]=v.union(m_list.pop(j))
              return find_intersection(m_list)
    return m_list

to_merge = [set(i) for i in to_merge if i]
to_merge = find_intersection(to_merge)
to_merge = [list(x) for x in to_merge]

(found from this answer)

Go through and sum all the rows that need to be merged (and drop pre-merge rows)

for idx_list in to_merge:
    df.iloc[idx_list[0], :] = df.iloc[idx_list, :].sum()
    df.iloc[idx_list[1:], :] = np.nan

df = df.dropna()
df['id_loc'] = df['id_loc'].apply(lambda x: list(set(x)))  # shared coords would be duped.
print(df)
Antoine Zambelli
  • 724
  • 7
  • 19
0

Antoine Zambelli answer is very good; as exercise, but also in hope it can help anyway, I wanna share my personal approach to the subject. It's not fully tested, but it should work.

# fun to merge elements
def merge_elements(ensemble, column):
    upper_list = []
    for index in ensemble:
        element_list = []
        for item in index:
            if not isinstance(df.loc[item, column], list):
                if not df.loc[item, column] in element_list:
                    element_list.append(df.loc[item, column])
            else:
                for obj in df.loc[item, column]:
                    if not obj in element_list:
                        element_list.append(obj)
        upper_list.append([element_list, index])
    return upper_list

# put results in dataframe
def put_in_df(df, piped, column):
    for elem in piped:
        for i in range(len(elem[1])):
            if column == "NAME" or column == "_GEOM":
                df.loc[elem[1][i], column] = str(elem[0]).replace("'", "")
            else:
                df.loc[elem[1][i], column] = str(elem[0])


# get list from df
list_of_locations = df.ID_LOCATION.tolist()

# get list of rows that need to be merged (no itertools needed)
# the dictionary I used here is an "overkill", I had no actual need for it, so also a common list can suit perfectly
rows = {}
for i, item in enumerate(list_of_locations):
    if isinstance(item, list):
        for j in range(0, len(item)):
            if item[j] in rows:
                rows[item[j]] = [rows[item[j]], i]
            else:
                rows[item[j]] = i
    else:
        if item in rows:
            rows[item] = [rows[item], i]
        else:
            rows[item] = i
ensemble = []

# as I said there was no need for a dictionary, this step can be summarized
for item in rows.values():
    if isinstance(item, list):
        ensemble.append(item)
# conversion to tuple is optional
ensemble = tuple(ensemble)

# merge list of tuples according to indexes retrieved
put_in_df(df, merge_elements(ensemble, "ID_LOCATION"), "ID_LOCATION")
put_in_df(df, merge_elements(ensemble, "NAME"), "NAME")
put_in_df(df, merge_elements(ensemble, "KEYS"), "KEYS")
put_in_df(df, merge_elements(ensemble, "_GEOM"), "_GEOM")

# special thanks to: https://stackoverflow.com/questions/43855462/pandas-drop-duplicates-method-not-working?rq=1
df = df.iloc[df.astype(str).drop_duplicates().index]

as I also put in comments, thanks to Pandas drop_duplicates method not working for dropping duplicates even in presence of lists

Michele Rava
  • 184
  • 4
  • 17