3

I have a folder with numerous csv files which look like this:

csv1

        2006    Percent       Land_Use
    0     13   5.379564      Developed
    1      8  25.781580  Grass/Pasture
    2      4  54.265050           Crop
    3     15   0.363983          Water
    4     16   6.244104       Wetlands
    5      6   4.691764         Forest
    6      1   3.031494        Alfalfa
    7     11   0.137424      Shrubland
    8      5   0.003671          Vetch
    9      3   0.055412         Barren
    10     7   0.009531          Grass
    11    12   0.036423           Tree

csv2

   2007    Percent       Land_Use
0     13   2.742430      Developed
1      4  56.007242           Crop
2      8  24.227963  Grass/Pasture
3     16   8.839979       Wetlands
4      6   6.181062         Forest
5      1   1.446668        Alfalfa
6     15   0.366116          Water
7      3   0.127760         Barren
8     11   0.034426      Shrubland
9      7   0.000827          Grass
10    12   0.025528           Tree

csv3

    2008    Percent       Land_Use
0    13   1.863809      Developed
1     8  31.455578  Grass/Pasture
2     4  57.896856           Crop
3    16   2.693929       Wetlands
4     6   4.417966         Forest
5     1   1.239176        Alfalfa
6     7   0.130849          Grass
7    15   0.266536          Water
8    11   0.004571      Shrubland
9     3   0.030731         Barren

and I want to merge them all together into one DataFrame on Land_Use

I am reading in the files like this:

pth = (r'G:\') 
for f in os.listdir(pth):
df=pd.read_csv(os.path.join(pth,f)

but I can't figure out how to merge all the individual dataframes after that. I figured out how to concat them but that isn't what I want. The type of merge I want is outer.

If I were to use a pathway to each csv file I would merge them like this, but I do NOT want to set a pathway to each file as there are many of them:

    one=pd.read_csv(r'G:\one.csv')
    two=pd.read_csv(r'G:\two.csv')
    three=pd.read_csv(r'G:\three.csv')
    merge=pd.merge(one,two, on=['Land_Use'], how='outer')
    mergetwo=pd.merge(merge,three,on=['Land_Use'], how='outer')
Stefano Potter
  • 3,467
  • 10
  • 45
  • 82

2 Answers2

5

I think you can use in python 3:

import functools

dfs = [df1,df2,df3]

df = functools.reduce(lambda left,right: pd.merge(left,right,on='Land_Use',how='outer'),dfs)
print (df)
    2006  Percent_x       Land_Use  2007  Percent_y  2008    Percent
0     13   5.379564      Developed  13.0   2.742430  13.0   1.863809
1      8  25.781580  Grass/Pasture   8.0  24.227963   8.0  31.455578
2      4  54.265050           Crop   4.0  56.007242   4.0  57.896856
3     15   0.363983          Water  15.0   0.366116  15.0   0.266536
4     16   6.244104       Wetlands  16.0   8.839979  16.0   2.693929
5      6   4.691764         Forest   6.0   6.181062   6.0   4.417966
6      1   3.031494        Alfalfa   1.0   1.446668   1.0   1.239176
7     11   0.137424      Shrubland  11.0   0.034426  11.0   0.004571
8      5   0.003671          Vetch   NaN        NaN   NaN        NaN
9      3   0.055412         Barren   3.0   0.127760   3.0   0.030731
10     7   0.009531          Grass   7.0   0.000827   7.0   0.130849
11    12   0.036423           Tree  12.0   0.025528   NaN        NaN

In python 2:

df = reduce(lambda left,right: pd.merge(left,right,on='Land_Use',how='outer'),dfs)

Working solution with glob:

import pandas as pd
import functools
import glob

pth = 'a/*.csv'
files = glob.glob(pth)
dfs = [pd.read_csv(f, sep=';') for f in files]

df = functools.reduce(lambda left,right: pd.merge(left,right,on='Land_Use', how='outer'),dfs)
print (df)
    2006  Percent_x       Land_Use  2008  Percent_y  2007    Percent
0     13   5.379564      Developed  13.0   1.863809  13.0   2.742430
1      8  25.781580  Grass/Pasture   8.0  31.455578   8.0  24.227963
2      4  54.265050           Crop   4.0  57.896856   4.0  56.007242
3     15   0.363983          Water  15.0   0.266536  15.0   0.366116
4     16   6.244104       Wetlands  16.0   2.693929  16.0   8.839979
5      6   4.691764         Forest   6.0   4.417966   6.0   6.181062
6      1   3.031494        Alfalfa   1.0   1.239176   1.0   1.446668
7     11   0.137424      Shrubland  11.0   0.004571  11.0   0.034426
8      5   0.003671          Vetch   NaN        NaN   NaN        NaN
9      3   0.055412         Barren   3.0   0.030731   3.0   0.127760
10     7   0.009531          Grass   7.0   0.130849   7.0   0.000827
11    12   0.036423           Tree   NaN        NaN  12.0   0.025528
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • but within my loop each dataframe is called `df`. How would I assign them to `df1`, `df2`, `df3` etc.? – Stefano Potter Jun 01 '16 at 18:31
  • I know this isn't part of the question, but I actually have files that have `Land_Use` not in my example such as `Vegetation`. This method excludes that `Land_Use` because it is not in the `left` or `right` file. Is there away to include all `Land_Uses`? – Stefano Potter Jun 01 '16 at 21:39
1

I am not allowed to comment, so I am unsure of what you exactly want. You can try using one.merge(two, on=['Land_Use'], how='outer').merge(three,on=['Land_Use'], how='outer'). Let me know if you wanted something else.

If you have many dataframes, you can try using the reduce function. First create a list containing all the dataframes dataframes = [one, two, three, four, ... , twenty] You can add them into the list by using list comprehensions or by appending them into the list in your loop. Then if you want to combine them based on Land_Use, you can use df_final = reduce(lambda left,right: pd.merge(left,right,on=['Land_Use'], how='outer'), dataframes)

Note: The reduce function is in the functools package in python 3+

Nitin Kashyap
  • 184
  • 1
  • 1
  • 13