0

I'm collecting time-indexed data coming from various files, but sometimes there is some overlapping:

df1 = pd.DataFrame([1, -1, -3], columns=['A'], index=pd.date_range('2000-01-01', periods=3))
df2 = pd.DataFrame([-3, 10, 1], columns=['A'], index=pd.date_range('2000-01-03', periods=3))
pd.concat([df1, df2])

            A
2000-01-01  1
2000-01-02 -1
2000-01-03 -3

             A
2000-01-03  -3
2000-01-04  10
2000-01-05   1

             A
2000-01-01   1
2000-01-02  -1
2000-01-03  -3
2000-01-03  -3
2000-01-04  10
2000-01-05   1

1) How to clean and remove the duplicate lines ? (here 2000-01-03)

2) More generally, is there a faster / more clever way with pandas to read and merge multiple csv files than doing manually:

L=[]
for f in glob.glob('*.csv'):
    L.append(pd.read_csv(f, ...))
fulldata = pd.concat(L)                   # this can be time consuming
fulldata.remove_duplicate_lines()         # this can be time consuming too
Basj
  • 41,386
  • 99
  • 383
  • 673
  • You can use `pd.concat(L, axis=1)` – EdChum Dec 10 '15 at 09:23
  • @EdChum I tried with `axis=1` but I suddenly get a 3 columns table (index + A + A again), and lots of NaN values. Any idea for 2) ? – Basj Dec 10 '15 at 09:26
  • Sorry try just `pd.concat(L)`, if possible can you post the output from that if it gives an issue – EdChum Dec 10 '15 at 09:27
  • @EdChum I did : see my question, third output is `pd.concat([df1, df2])` – Basj Dec 10 '15 at 09:29
  • 2
    You could do `pd.concat([df1, df2]).drop_duplicates()`. Are you looking for solution with only one command? – Anton Protopopov Dec 10 '15 at 09:31
  • @AntonProtopopov No `drop_duplicates` doesn't work : it would also delete 2000-01-05 in my (updated) question! It looks for duplicates in the *values*, whereas I'm speaking about duplicates in the *index*. – Basj Dec 10 '15 at 10:15

2 Answers2

2

IIUC you could do pd.concat and then do drop_duplicates:

In [104]: pd.concat([df1, df2]).drop_duplicates()
Out[104]: 
             A
2000-01-01   1
2000-01-02  -1
2000-01-03  -3
2000-01-04  10
2000-01-05   7

EDIT

You are right, that method isn't working properly because it drops by value not by index. For index you could duplicated for index:

df = pd.concat([df1, df2])
df[~df.index.duplicated()]

In [107]: df[~df.index.duplicated()]
Out[107]: 
             A
2000-01-01   1
2000-01-02  -1
2000-01-03  -3
2000-01-04  10
2000-01-05   1

Or you could use 1st method with modification, first you need to do reset_index, and then use drop_duplicates but for index values with subset key:

 pd.concat([df1, df2]).reset_index().drop_duplicates(subset='index').set_index('index')

In [118]: pd.concat([df1, df2]).reset_index().drop_duplicates(subset='index').set_index('index')
Out[118]: 
             A
index         
2000-01-01   1
2000-01-02  -1
2000-01-03  -3
2000-01-04  10
2000-01-05   1
Anton Protopopov
  • 30,354
  • 12
  • 88
  • 93
  • Thanks! For question #2, would you do a loop over all .csv files with `glob.glob('*.csv')`, or is there a simple way with pandas to load data from multiple files? – Basj Dec 10 '15 at 09:35
  • @Basj I think that without looping you couldn't do that. See the answer for [that](http://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-python-pandas-and-concatenate-into-one-dataframe#21232849) question – Anton Protopopov Dec 10 '15 at 09:37
  • Warning! I just noticed `drop_duplicates` doesn't work: it would also delete 2000-01-05 in my (updated) question! It looks for duplicates in the *values*, whereas I'm speaking about duplicates in the *index*. – Basj Dec 10 '15 at 10:15
2

If you're feeling adventurous and decide to use something other than Pandas to combine CSVs, and you're on a machine with Awk, you can combine various files and remove duplicates with this single command:

awk '!arr[$0]++' /path/to/your/files/* > combined_no_dups.csv

And then you could load it into pandas...

df = pd.read_csv("combined_no_dups.csv")
ComputerFellow
  • 11,710
  • 12
  • 50
  • 61