1

I have a large CSV with which has several columns and one of the column is address column. In another CSV the address with latitude and longitude are given. I want to match the address and pull the latitude and longitude and write in my first large csv.

I did try using pandas join and merge but they merge with index column and I need to search the address in the address csv and then pull the lat-lon to large csv and create two new columns.

I found some similar question but that is Ruby :Here: I want to work with Python and Pandas

Large.csv 

col1, col2, col3, col4, col5, col6, addr
234, 453,34535,342,634,636, Ken street
562, 345,6753,835,864,967,St Pauls

addr.csv 

addr,lat,lon
baker street, lat1, lon1
ken street, lat2,lon2
paul street, lat3, lon3
St Pauls, lat4, lon4

Ouput.csv 
col1, col2, col3, col4, col5, col6, addr, lat, lon
234, 453,34535,342,634,636, Ken street,lat2,lon2
562, 345,6753,835,864,967,St Pauls,lat4,lon4

The input file is too large to load in an Excel sheet and make a Vlookup What could be the simplest way to achieve the output.

I am using python 3.x version and Updated version of Pandas Every Help is appreciated Thanks in advance

Community
  • 1
  • 1
Phani.lav
  • 153
  • 1
  • 1
  • 10
  • Possible duplicate: http://stackoverflow.com/questions/20414562/python-joining-csv-files-where-key-is-first-column-value – moritzg May 08 '17 at 09:28

2 Answers2

2

This is how you generally do it, with addr data fixed:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

large = StringIO('''col1, col2, col3, col4, col5, col6, addr
234, 453,34535,342,634,636, Ken street
562, 345,6753,835,864,967,St Pauls''')

addr = StringIO('''addr,lat,lon
baker street, lat1, lon1
ken street, lat2,lon2
paul street, lat3, lon3
St Pauls, lat4, lon4''')

df_large = pd.read_csv(large, sep=',')
df_large.columns = df_large.columns.str.strip()
df_large.addr = df_large.addr.apply(lambda x: x.lower().strip())

df_addr = pd.read_csv(addr, sep=',')

df_addr.addr = df_addr.addr.apply(lambda x: x.lower().strip())

df_large = pd.merge(df_large, df_addr, how='left', on='addr')
zipa
  • 27,316
  • 6
  • 40
  • 58
1

You can do inner merge of 2 data frames. And then df.to_csv('Ouput.csv') to write the output data frame to a csv file.

>>> df1
   col1  col2   col3  col4  col5  col6        addr
0   234   453  34535   342   634   636  Ken street
1   562   345   6753   835   864   967    St Pauls
>>> df2
           addr   lat   lon
0  baker street  lat1  lon1
1    Ken street  lat2  lon2
2   paul street  lat3  lon3
3      St Pauls  lat4  lon4
>>> df = pd.merge(df1, df2, how='inner', on='addr')
>>> df
   col1  col2   col3  col4  col5  col6        addr   lat   lon
0   234   453  34535   342   634   636  Ken street  lat2  lon2
1   562   345   6753   835   864   967    St Pauls  lat4  lon4
Nilanjan
  • 176
  • 8