4

I am having problems to use the proper pandas function to drop rows in dataframe of duplicate value of a key inside a dict in one of its column lugar. enter image description here

This is the data of the dataframe:

{'calculado': {73683: '2021-05-27 00:43:46',
  83767: '2021-05-27 00:43:46',
  103395: '2021-05-27 00:43:46',
  105314: '2021-05-27 00:43:46',
  116555: '2021-05-27 00:43:46',
  120764: '2021-05-27 00:43:46',
  120892: '2021-05-27 00:43:46',
  122760: '2021-05-27 00:43:46',
  124269: '2021-05-27 00:43:46',
  125707: '2021-05-27 00:43:46'},
 'geom': {73683: '17.649999999999995,-93.65',
  83767: '15.55,-93.25',
  103395: '11.45,-98.45',
  105314: '11.049999999999997,-98.55',
  116555: '8.75,-78.45',
  120764: '7.849999999999997,-89.54999999999998',
  120892: '7.849999999999997,-76.75',
  122760: '7.449999999999998,-81.95',
  124269: '7.149999999999999,-75.04999999999998',
  125707: '6.849999999999998,-75.25'},
 'lat': {73683: 17.649999999999995,
  83767: 15.55,
  103395: 11.45,
  105314: 11.049999999999997,
  116555: 8.75,
  120764: 7.849999999999997,
  120892: 7.849999999999997,
  122760: 7.449999999999998,
  124269: 7.149999999999999,
  125707: 6.849999999999998},
 'lon': {73683: -93.65,
  83767: -93.25,
  103395: -98.45,
  105314: -98.55,
  116555: -78.45,
  120764: -89.54999999999998,
  120892: -76.75,
  122760: -81.95,
  124269: -75.04999999999998,
  125707: -75.25},
 'lugar': {73683: {'distancia': 12.55,
   'mensaje': '13 kms. de Huimanguillo, Tabasco, México',
   'nombre': 'Huimanguillo, Tabasco, México',
   'pais': 'mx'},
  83767: {'distancia': 16.74,
   'mensaje': '17 kms. de Pijijiapan, Chiapas, 30540, México',
   'nombre': 'Pijijiapan, Chiapas, 30540, México',
   'pais': 'mx'},
  103395: 'Mar abierto',
  105314: 'Mar abierto',
  116555: {'distancia': 6.7,
   'mensaje': '7 kms. de Río Congo Arriba, Distrito Santa Fe, Darién, Panamá',
   'nombre': 'Río Congo Arriba, Distrito Santa Fe, Darién, Panamá',
   'pais': 'pa'},
  120764: 'Mar abierto',
  120892: {'distancia': 5.83,
   'mensaje': '6 kms. de Veraguas, Panamá',
   'nombre': 'Veraguas, Panamá',
   'pais': 'co'},
  122760: {'distancia': 100.26,
   'mensaje': '100 kms. de Veraguas, Panamá',
   'nombre': 'Veraguas, Panamá',
   'pais': 'pa'},
  124269: {'distancia': 12.09,
   'mensaje': '12 kms. de Anorí, Nordeste, Antioquia, Región Andina, 052857, Colombia',
   'nombre': 'Anorí, Nordeste, Antioquia, Región Andina, 052857, Colombia',
   'pais': 'co'},
  125707: {'distancia': 4.03,
   'mensaje': '4 kms. de Guadalupe, Norte, Antioquia, Región Andina, Colombia',
   'nombre': 'Guadalupe, Norte, Antioquia, Región Andina, Colombia',
   'pais': 'co'}},
 'valor': {73683: 198,
  83767: 198,
  103395: 197,
  105314: 198,
  116555: 198,
  120764: 198,
  120892: 198,
  122760: 198,
  124269: 196,
  125707: 198},
 'variable': {73683: 'T',
  83767: 'T',
  103395: 'T',
  105314: 'T',
  116555: 'T',
  120764: 'T',
  120892: 'T',
  122760: 'T',
  124269: 'T',
  125707: 'T'}}

As you can see, the lugar column has a dictionary and one of the keys is nombre in this case the value: Veraguas, Panamá is duplicated, I will like to drop duplicates rows of dataframe and keep one row only per name for the rows that has the dict and key in lugar column.

One approach I have tried is to create a new column with the value of the key and then run drop_duplicates but I am unable to get the value inside the column. but I am able to get it for the first row like this

df_asc['lugar'].iloc[0]['nombre'] -> Huimanguillo, Tabasco, México

Is there a way to do this without looping the df doing it manually? I am really new to Python and Pandas.

EDITED: Expected result I converted to csv to be able to delete in spreadsheet as I am unable to do it with pandas... enter image description here

neavilag
  • 609
  • 1
  • 8
  • 20
  • 2
    +1 on the sample dataframe. easy to reproduce. Kindly share the expected output as well; just to be sure that solutions proferred will match what you are after – sammywemmy May 27 '21 at 01:17

3 Answers3

2

An option via loc + duplicated + isna:

s = df['lugar'].str['nombre']
df.loc[~s.duplicated() | s.isna()]
                  calculado  ... variable
73683   2021-05-27 00:43:46  ...        T
83767   2021-05-27 00:43:46  ...        T
103395  2021-05-27 00:43:46  ...        T
105314  2021-05-27 00:43:46  ...        T
116555  2021-05-27 00:43:46  ...        T
120764  2021-05-27 00:43:46  ...        T
120892  2021-05-27 00:43:46  ...        T
124269  2021-05-27 00:43:46  ...        T
125707  2021-05-27 00:43:46  ...        T

[9 rows x 7 columns]
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • Excellent on line solution, other answers great too as got to know how to create new column from the dict. – neavilag May 27 '21 at 01:59
1

I suggest you should extract new column from lugar column like below code.

import pandas as pd

## data is dictionary data you uploaded ##
df = pd.DataFrame(data)

def extract_lugar_nombre(lugar) :
    try :
        return lugar["nombre"]
    except :
        ## if lugar column doesn't have nombre key or is not dict type ##
        return None
    
df["lugar_nombre"] = df["lugar"].apply(extract_lugar_nombre)

If you run this code, DataFrame has lugar_nombre column having lugar_nombre. enter image description here

And you can filtered duplicated rows.

df[df.lugar_nombre.duplicated()==False]
G.Young
  • 121
  • 2
1

Let us try

df['new'] = df['lugar'].str.get('nombre')
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thanks it worked to then drop duplicates just got this warning: /usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead – neavilag May 27 '21 at 01:52
  • @neavilag https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas – BENY May 27 '21 at 02:00