1

I'm trying to merge many (a few thousand one column tsv files) data frames into a single csv file using pandas. I'm new to pandas (and python for that matter) and could use some input or direction.

My data frames are observational data on a list scraped from the web and do not contain headers. For example:

data frame 1:

bluebird    34
chickadee   168
eagle       10
hawk        67
sparrow     2

data frame 2:

albatross    56
bluebird     78
hawk         3
pelican      19
sparrow      178

I'm looking to do is simply create a master file with all of the individual observations:

albatross     0     56
bluebird      34    78
chickadee     168   0
eagle         10    0
hawk          67    3
pelican       0     19
sparrow       2     178

I've tried to merge the data frames one at a time using pandas:

import pandas as pd

df1 = pd.read_table("~/home/birds1.tsv", sep='\t')
df2 = pd.read_table("~/home/birds2.tsv", sep='\t')
merged = df1.merge(df1, df2, how="left").fillna("0")
merged.to_csv("merged.csv", index=False)

but I am only getting one column. I don't have a master list of "birds", but I can concatenate all the data and sort on unique names for a dictionary list if this is needed.

What should my strategy be for merging a few thousand files?

Josh Herr
  • 125
  • 1
  • 7
  • What are the names of the indexes and columns? – Simeon Visser Nov 05 '14 at 21:06
  • The data frames as provided to me do not have indexes or column names. I suppose I could add them to the existing data but that would add a step -- fairly easily at the command line. Thanks Simeon! – Josh Herr Nov 14 '14 at 18:33

2 Answers2

2

Look at the docs for merge, when called from a frame, the first parameter is the 'other' frame, and the second is which variables you want to merge on (not actually sure what happens when you pass a DataFrame).

But, assuming your bird column is called 'bird', what you probably want is:

In [412]: df1.merge(df2, on='bird', how='outer').fillna(0)
Out[412]: 
        bird  value_x  value_y
0   bluebird       34       78
1  chickadee      168        0
2      eagle       10        0
3       hawk       67        3
4    sparrow        2      178
5  albatross        0       56
6    pelican        0       19
chrisb
  • 49,833
  • 8
  • 70
  • 70
1

I would think the fastest way is to set the column you want to merge on to the index, create a list of the dataframes and then pd.concat them. Something like this:

import os
import pandas as pd
directory = os.path.expanduser('~/home')
files = os.path.listdir(directory)
dfs = []
for filename in files:
    if '.tsv' in file:
        df = pd.read_table(os.path.join(directory,filename),sep='\t').set_index('bird')
        dfs.append(df)
master_df = pd.concat(dfs,axis=1)
exp1orer
  • 11,481
  • 7
  • 38
  • 51