1

Background

There is 5 DataFrames, I will call them b1, b2, b3, b4, b5.

They have data structure which consists of the columns ['Date', 'Value']

I got the data from 2014 to 2015.

Problem

Every DataFrame has different date counting system. So, I want to get rid of the rows in the DataFrames which have non-matching dates.

How can I do that?

What I have tried

So, I will delete all the data if Date of the data is not included in all DataFrames: b1, b2, b3, b4, b5

This works for the first time

for i in range(len(b2.index)):
    k = 0
    for j in range(len(b1.index)):
        if b2['Date'][i] == b1['Date'][j]:
            k = k+1
        else:
            k = k
    if k == 1:
        pass
    if k == 0:
        b2 = b2.drop([i])

But, after that I excute this code one more time after I did that, there would be some error like this:

KeyError Traceback (most recent call last) in () 2k = 0 3for j in range(len(b2.index)): ----> 4 if b1['Date'][i] == b2['Date'][j]: 5 k = k+1 6 else:

C:\Users\cms\Anaconda\lib\site-packages\pandas\core\series.pyc in getitem(self, key) 519def getitem(self, key): 520 try: --> 521 result = self.index.get_value(self, key) 522 523 if not np.isscalar(result):

C:\Users\cms\Anaconda\lib\site-packages\pandas\core\index.pyc in get_value(self, series, key) 1593 1594 try: -> 1595 return self._engine.get_value(s, k) 1596 except KeyError as e1: 1597 if len(self) > 0 and self.inferred_type in ['integer','boolean']:

pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3113)()

pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:2844)()

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3704)()

pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:7224)()

pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:7162)()

KeyError: 28L

What I want to do is

mlist = (b1,b2,b3,b4,b5)
for q in mlist:
    for r in mlist:
        for i in range(len(q.index)):
            k = 0
            for j in range(len(r.index)):
                if q['Date'][i] == r['Date'][j]:
                    k = k+1
                else:
                    k = k
            if k == 1:
                pass
            if k == 0:
                q = q.drop([i])`enter code here`
firelynx
  • 30,616
  • 9
  • 91
  • 101
  • ```q.drop``` takes an index value as arg, which is not necessarily an integer index. Try q.drop(q.index[i]). Also, there is no need for k, simply do ```if q['Date'][i] == r['Date'][j]: continue``` and ```else: q = q.drop(q.index[i])``` – dermen Jul 31 '15 at 18:35

3 Answers3

0

How about this? Use set.intersecton() :

s = set(b1.index)
for b in [b2,b3,b4,b5]:
    s=s.intersection(set(b.index))
for b in [b1,b2,b3,b4,b5]:
    b=b.drop(b.index[-b.index.isin(s)],inplace=True)
Jihun
  • 1,415
  • 1
  • 12
  • 16
  • It kind of works, but, I want to make it to have the same index for every index. For example, b1, b2, b3, b4, b5 need to have 2014-07-30 in index #5, 2014-07-31 in #6. – Seong NohYoon Jul 29 '15 at 06:40
  • Could you provide example input and desired output? – Jihun Jul 29 '15 at 07:03
  • Can you please help me? if you give me your email or something, I will give you my data. Actually, the code below works well, but, It returns 'empty' DataFrame, not knowing why. – Seong NohYoon Jul 31 '15 at 07:45
  • @SeongNohYoon Why don't you just write what went wrong when you tried this solution, instead of asking three different people for email support? – firelynx Jul 31 '15 at 07:55
0

You want to keep the dates that are present in all datasets.

This can easily be done by performing an inner join on all of them using the pandas.merge() function.

b = b1.merge(on='Date', right=b2, how='inner', suffixes=['', '_b2'])
b = b.merge(on='Date', right=b3, how='inner', suffixes=['', '_b3'])
b = b.merge(on='Date', right=b4, how='inner', suffixes=['', '_b4'])
b = b.merge(on='Date', right=b5, how='inner', suffixes=['_b1', '_b5'])

I am making some assumptions about your data here since you did not post any example data yet.

If I got anything wrong about your data, please let me know and I will correct my example.

Example:

b1 = pd.DataFrame({'Date':pd.date_range('2015-05-05', '2015-05-10'), 'Value':range(1,7)})
        Date  Value
0 2015-05-05      1
1 2015-05-06      2
2 2015-05-07      3
3 2015-05-08      4
4 2015-05-09      5
5 2015-05-10      6

b2 = pd.DataFrame({'Date':pd.date_range('2015-05-07', '2015-05-12'), 'Value':range(4,10)})
        Date  Value
0 2015-05-05      1
1 2015-05-06      2
2 2015-05-07      3
3 2015-05-08      4
4 2015-05-09      5
5 2015-05-10      6

b = b1.merge(on='Date', right=b2, how='inner', suffixes=['_b1', '_b2'])
        Date  Value_b1  Value_b2
0 2015-05-07         3         4
1 2015-05-08         4         5
2 2015-05-09         5         6
3 2015-05-10         6         7
firelynx
  • 30,616
  • 9
  • 91
  • 101
  • Can you please help me? if you give me your email or something, I will give you my data. Actually, the code below works well, but, It returns 'empty' DataFrame, not knowing why. – Seong NohYoon Jul 31 '15 at 07:46
  • @SeongNohYoon Why don't you just write what went wrong when you tried this solution, instead of asking three different people for email support? – firelynx Jul 31 '15 at 07:55
  • First of all, is there 'right' argument in join function? – Seong NohYoon Jul 31 '15 at 08:16
  • @SeongNohYoon, sorry, syntax error from my side. It is supposed to be the merge function, not the join function. It should work now. – firelynx Jul 31 '15 at 08:37
0

From this posting you can see an answer

merged_mlist = reduce(lambda left,right: pandas.merge(left,right,on='Date', how='inner'), mlist)

I believe the reason your code is failing is because you are not resetting the index after dropping. I think if you want to do this manually, you can do something like store the indices first and then drop all at once

inds_to_drop = []
for i in range(len(b2)):
    for j in range(len(b1)):
        if b2['Date'][i] != b1['Date'][j]:
            inds_to_drop.append(i)
b2.drop( b2.index[inds_to_drop], inplace=True)    
b2.reset_index( drop=True,inplace=True) # this may or may not be necessary, havent thought all the way through

Better yet, if you still want to implement your for loop

inds_to_drop = [ i for i,dt in enumerate(b2.Date) if not pandas.np.where( b1.Date==dt)[0].size ] 

You can do that for each iteration.. Though it is just easier to run the reduce function paired with pandas.merge

Community
  • 1
  • 1
dermen
  • 5,252
  • 4
  • 23
  • 34
  • Can you please help me? if you give me your email or something, I will give you my data. Actually, the code below works well, but, It returns 'empty' DataFrame, not knowing why. – Seong NohYoon Jul 31 '15 at 07:46
  • @SeongNohYoon Why don't you just write what went wrong when you tried this solution, instead of asking three different people for email support? – firelynx Jul 31 '15 at 07:55
  • what is the bad_inds..? – Seong NohYoon Jul 31 '15 at 08:21
  • there might have been a typo in that portion of the code.. I edited it.. bad_inds was supposed to be ```inds_to_drop```.. but really I think you should try the first part: ```reduce(lambda left,right: pandas.merge(left,right,on='Date', how='inner'), mlist)``` where ```mlist=(b1,b2,b3,b4,b5)``` – dermen Jul 31 '15 at 18:32