Delete rows not present in all DataFrame in pandas

Question

Background

There is 5 DataFrames, I will call them b1, b2, b3, b4, b5.

They have data structure which consists of the columns ['Date', 'Value']

I got the data from 2014 to 2015.

Problem

Every DataFrame has different date counting system. So, I want to get rid of the rows in the DataFrames which have non-matching dates.

How can I do that?

What I have tried

So, I will delete all the data if Date of the data is not included in all DataFrames: b1, b2, b3, b4, b5

This works for the first time

for i in range(len(b2.index)):
    k = 0
    for j in range(len(b1.index)):
        if b2['Date'][i] == b1['Date'][j]:
            k = k+1
        else:
            k = k
    if k == 1:
        pass
    if k == 0:
        b2 = b2.drop([i])

But, after that I excute this code one more time after I did that, there would be some error like this:

KeyError Traceback (most recent call last) in () 2k = 0 3for j in range(len(b2.index)): ----> 4 if b1['Date'][i] == b2['Date'][j]: 5 k = k+1 6 else:

C:\Users\cms\Anaconda\lib\site-packages\pandas\core\series.pyc in getitem(self, key) 519def getitem(self, key): 520 try: --> 521 result = self.index.get_value(self, key) 522 523 if not np.isscalar(result):

C:\Users\cms\Anaconda\lib\site-packages\pandas\core\index.pyc in get_value(self, series, key) 1593 1594 try: -> 1595 return self._engine.get_value(s, k) 1596 except KeyError as e1: 1597 if len(self) > 0 and self.inferred_type in ['integer','boolean']:

pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3113)()

pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:2844)()

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3704)()

pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:7224)()

pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:7162)()

KeyError: 28L

What I want to do is

mlist = (b1,b2,b3,b4,b5)
for q in mlist:
    for r in mlist:
        for i in range(len(q.index)):
            k = 0
            for j in range(len(r.index)):
                if q['Date'][i] == r['Date'][j]:
                    k = k+1
                else:
                    k = k
            if k == 1:
                pass
            if k == 0:
                q = q.drop([i])`enter code here`

```q.drop``` takes an index value as arg, which is not necessarily an integer index. Try q.drop(q.index[i]). Also, there is no need for k, simply do ```if q['Date'][i] == r['Date'][j]: continue``` and ```else: q = q.drop(q.index[i])``` — dermen, Jul 31 '15 at 18:35

score 0 · Answer 1 · answered Jul 29 '15 at 05:19

0

How about this? Use set.intersecton() :

s = set(b1.index)
for b in [b2,b3,b4,b5]:
    s=s.intersection(set(b.index))
for b in [b1,b2,b3,b4,b5]:
    b=b.drop(b.index[-b.index.isin(s)],inplace=True)

answered Jul 29 '15 at 05:19

Jihun

1,415
1
12
16

It kind of works, but, I want to make it to have the same index for every index. For example, b1, b2, b3, b4, b5 need to have 2014-07-30 in index #5, 2014-07-31 in #6. – Seong NohYoon Jul 29 '15 at 06:40
Could you provide example input and desired output? – Jihun Jul 29 '15 at 07:03
Can you please help me? if you give me your email or something, I will give you my data. Actually, the code below works well, but, It returns 'empty' DataFrame, not knowing why. – Seong NohYoon Jul 31 '15 at 07:45
@SeongNohYoon Why don't you just write what went wrong when you tried this solution, instead of asking three different people for email support? – firelynx Jul 31 '15 at 07:55

firelynx · Answer 2 · 2015-07-31T08:44:33.897

You want to keep the dates that are present in all datasets.

This can easily be done by performing an inner join on all of them using the pandas.merge() function.

b = b1.merge(on='Date', right=b2, how='inner', suffixes=['', '_b2'])
b = b.merge(on='Date', right=b3, how='inner', suffixes=['', '_b3'])
b = b.merge(on='Date', right=b4, how='inner', suffixes=['', '_b4'])
b = b.merge(on='Date', right=b5, how='inner', suffixes=['_b1', '_b5'])

I am making some assumptions about your data here since you did not post any example data yet.

If I got anything wrong about your data, please let me know and I will correct my example.

Example:

b1 = pd.DataFrame({'Date':pd.date_range('2015-05-05', '2015-05-10'), 'Value':range(1,7)})
        Date  Value
0 2015-05-05      1
1 2015-05-06      2
2 2015-05-07      3
3 2015-05-08      4
4 2015-05-09      5
5 2015-05-10      6

b2 = pd.DataFrame({'Date':pd.date_range('2015-05-07', '2015-05-12'), 'Value':range(4,10)})
        Date  Value
0 2015-05-05      1
1 2015-05-06      2
2 2015-05-07      3
3 2015-05-08      4
4 2015-05-09      5
5 2015-05-10      6

b = b1.merge(on='Date', right=b2, how='inner', suffixes=['_b1', '_b2'])
        Date  Value_b1  Value_b2
0 2015-05-07         3         4
1 2015-05-08         4         5
2 2015-05-09         5         6
3 2015-05-10         6         7

Can you please help me? if you give me your email or something, I will give you my data. Actually, the code below works well, but, It returns 'empty' DataFrame, not knowing why. — Seong NohYoon, Jul 31 '15 at 07:46
@SeongNohYoon Why don't you just write what went wrong when you tried this solution, instead of asking three different people for email support? — firelynx, Jul 31 '15 at 07:55
@SeongNohYoon, sorry, syntax error from my side. It is supposed to be the merge function, not the join function. It should work now. — firelynx, Jul 31 '15 at 08:37

score 0 · Answer 3 · edited May 23 '17 at 11:43

0

From this posting you can see an answer

merged_mlist = reduce(lambda left,right: pandas.merge(left,right,on='Date', how='inner'), mlist)

I believe the reason your code is failing is because you are not resetting the index after dropping. I think if you want to do this manually, you can do something like store the indices first and then drop all at once

inds_to_drop = []
for i in range(len(b2)):
    for j in range(len(b1)):
        if b2['Date'][i] != b1['Date'][j]:
            inds_to_drop.append(i)
b2.drop( b2.index[inds_to_drop], inplace=True)    
b2.reset_index( drop=True,inplace=True) # this may or may not be necessary, havent thought all the way through

Better yet, if you still want to implement your for loop

inds_to_drop = [ i for i,dt in enumerate(b2.Date) if not pandas.np.where( b1.Date==dt)[0].size ]

You can do that for each iteration.. Though it is just easier to run the reduce function paired with pandas.merge

edited May 23 '17 at 11:43

Community

1
1

answered Jul 29 '15 at 09:12

dermen

5,252
4
23
34

Can you please help me? if you give me your email or something, I will give you my data. Actually, the code below works well, but, It returns 'empty' DataFrame, not knowing why. – Seong NohYoon Jul 31 '15 at 07:46
@SeongNohYoon Why don't you just write what went wrong when you tried this solution, instead of asking three different people for email support? – firelynx Jul 31 '15 at 07:55
what is the bad_inds..? – Seong NohYoon Jul 31 '15 at 08:21
there might have been a typo in that portion of the code.. I edited it.. bad_inds was supposed to be ```inds_to_drop```.. but really I think you should try the first part: ```reduce(lambda left,right: pandas.merge(left,right,on='Date', how='inner'), mlist)``` where ```mlist=(b1,b2,b3,b4,b5)``` – dermen Jul 31 '15 at 18:32

Delete rows not present in all DataFrame in pandas

Background

Problem

What I have tried

3 Answers3