0

I am trying to update temperature time series by combining 2 CSV files that may have duplicate rows at times.

I have tried to implement drop_duplicates but it's not working for me.

Here is an example of what I'm trying to do:

import pandas as pd
import numpy as np

from pandas import DataFrame, Series


dfA = DataFrame({'date' : Series(['1/1/10','1/2/10','1/3/10','1/4/10'], index=[0,1,2,3]),
    'a' : Series([60,57,56,50], index=[0,1,2,3]),
    'b' : Series([80,73,76,56], index=[0,1,2,3])})

print("dfA")     
print(dfA)

dfB = DataFrame({'date' : Series(['1/3/10','1/4/10','1/5/10','1/6/10'], index=[0,1,2,3]),
    'a' : Series([56,50,59,75], index=[0,1,2,3]),
    'b' : Series([76,56,73,89], index=[0,1,2,3])})

print("dfB")
print(dfB)

dfC = dfA.append(dfB)

print(dfC.duplicated())

dfC.drop_duplicates()
print("dfC")
print(dfC)

And this is the output:

dfA
    a   b    date
0  60  80  1/1/10
1  57  73  1/2/10
2  56  76  1/3/10
3  50  56  1/4/10
dfB
    a   b    date
0  56  76  1/3/10
1  50  56  1/4/10
2  59  73  1/5/10
3  75  89  1/6/10
0    False
1    False
2    False
3    False
0     True
1     True
2    False
3    False
dtype: bool
dfC
    a   b    date
0  60  80  1/1/10
1  57  73  1/2/10
2  56  76  1/3/10
3  50  56  1/4/10
0  56  76  1/3/10
1  50  56  1/4/10
2  59  73  1/5/10
3  75  89  1/6/10

How do I update a time series with overlapping data and not have duplicates?

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
Bill G.
  • 1
  • 1
  • 1
  • Hey Bill: check this out http://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries – Paul H Sep 18 '14 at 18:36
  • Instead of saying "it's not working for me", it would be helpful to describe *why* it isn't working. Do you get exceptions, bad results, or no response at all? – skrrgwasme Sep 18 '14 at 18:39

1 Answers1

4

The line dfC.drop_duplicates() does not actually change the DataFrame that dfC is bound to (it just returns a copy of it with no duplicate rows).

You can either specify that the DataFrame dfC is modified inplace by passing in the inplace keyword argument,

dfC.drop_duplicates(inplace=True)

or rebind the view of the de-duplicated DataFrame to the name dfC like this

dfC = dfC.drop_duplicates()
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • Of course. So simple. This now removes the duplicate rows from the combined CSV files. Thank you very much. Bill – Bill G. Sep 23 '14 at 21:21
  • @BillG. Glad it was helpful! By the way, if the answer solved the problem you can tell the community by [accepting the answer](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work/5235#5235). – Alex Riley Oct 04 '14 at 11:53