0

I am currently these datasets from the Kiva Kaggle Competition: https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding/data

I want to link a float 'MPI' value (a 'Multidimensional Poverty Index') to their corresponding geographical regions for each micro loan.

  • In one dataset kiva_mpi_region_locations.csv each region has a single corresponding MPI value associated with it.
  • However, in dataset kiva_loans.csv where each loan is given a "Region", the data often has multiple values in the same cell separated by commas (,).

['kiva_loans.csv'/Loan Data Example] (Note, Different loans can come from the same region so in this case region is a foreign key but not a primary key):

Loan #: 653338  
region: Tanjay, Negros Oriental 

[kiva_mpi_region_locations.csv/ Regional MPI value example] (Note, every region only has one MPI as region in a primary key):

region: Badakhshan

MPI: 0.387

My code so far:

RegionMPI = dict(zip(dfLocations.region, dfLocations.MPI))

{'Badakhshan': 0.387,
 'Badghis': 0.466,
 'Baghlan': 0.3,
 'Balkh': 0.301,
 'Bamyan': 0.325,
 'Daykundi': 0.313,
  etc}

LoanRegion = dfLoanTheme['region'].str.split(',').values.tolist()

 [['Lahore']
 nan,
 ['Dar es Salaam'],
 ['Liloy-Dela Paz'],
 ['Tanjay', ' Negros Oriental'],
 ['Ica'],
 nan,
['Lahore']]

Any advice on how to cycle through my nested list and then use my dictionary keys to link the corresponding value from my dictionary to my list for all occurrences of that key in my list?

smci
  • 32,567
  • 20
  • 113
  • 146
  • Your question is very hard to understand without defining terms. 'MPI' = [Multidimensional Poverty Index](http://ophi.org.uk/policy/multidimensional-poverty-index/), not 'Message-Passing Interface' – smci May 07 '18 at 04:50
  • Really this question should be called 'How to merge two dataframes', but then it should be closed as duplicate in favor of [Combine two pandas Data Frames (join on a common column)](https://stackoverflow.com/questions/18792918/combine-two-pandas-data-frames-join-on-a-common-column). You really don't want to reinvent the wheel by writing your own join code in base Python, use `pandas` package already. – smci May 07 '18 at 05:16
  • Anyway there are already tons of Kaggle kernels on that exact dataset, e.g. [A Very Extensive Kiva Exploratory Analysis ✓✓](https://www.kaggle.com/codename007/a-very-extensive-kiva-exploratory-analysis/versions) or any of the [50+ other Python kernels](https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding/kernels?sortBy=hotness&group=everyone&pageSize=20&datasetId=12414&language=Python). Just download them and copy-and-paste them to see how they work – smci May 07 '18 at 05:21

1 Answers1

0

You want to do a merge on two dataframes on the region field. pandas library makes this really easy (also performant). The code looks like this (your CSV files are behind the Kaggle registration-wall):

import pandas as pd

loans = pd.read_csv('kiva_loans.csv')
mpi_regions = pd.read_csv('kiva_mpi_region_locations.csv')

df = loans.merge(mpi_regions, on='region')

You really don't want to reinvent the wheel by writing your own join code in base Python, use pandas package already.

(Note you're assuming region is unique across countries. It might be safer to merge both on=['country','region'])

smci
  • 32,567
  • 20
  • 113
  • 146