DataFrame created with read_csv() gives unexpected query() results

Question

I'm using DataFrame.query() to find rows, and I've come across a problem that I can only replicate if the data is loaded from CSV. If I create what I think is the same DataFrame in pure Python, the query() works as expected.

This is the CSV of data:

,ASK_PRICE,ASK_QTY,BID_PRICE,BID_QTY
2016-06-17 16:38:00.043,104.258,50.0,104.253,100.0
2016-06-17 16:38:00.043,104.259,100.0,104.253,100.0
2016-06-17 16:38:02.978,104.259,100.0,104.254,50.0
2016-06-17 16:38:03.999,104.259,100.0,104.253,50.0
2016-06-17 16:38:03.999,104.259,100.0,104.251,150.0
2016-06-17 16:38:04.001,104.259,100.0,104.251,100.0

And this is an example script showing the problem:

#!/usr/bin/env python
import pandas as pd
import numpy as np
from datetime import datetime

timestamp = [
        datetime.strptime('2016-06-17 16:38:00.043', '%Y-%m-%d %H:%M:%S.%f'),
        datetime.strptime('2016-06-17 16:38:00.043', '%Y-%m-%d %H:%M:%S.%f'),
        datetime.strptime('2016-06-17 16:38:02.978', '%Y-%m-%d %H:%M:%S.%f'),
        datetime.strptime('2016-06-17 16:38:03.999', '%Y-%m-%d %H:%M:%S.%f'),
        datetime.strptime('2016-06-17 16:38:03.999', '%Y-%m-%d %H:%M:%S.%f'),
        datetime.strptime('2016-06-17 16:38:04.001', '%Y-%m-%d %H:%M:%S.%f')
        ]
bid_price = [ 104.253, 104.253, 104.254, 104.253, 104.251, 104.251 ]
bid_qty = [ 100.0, 100.0, 50.0, 50.0, 150.0, 100.0 ]
ask_price = [ 104.258, 104.259, 104.259, 104.259, 104.259, 104.259 ]
ask_qty = [ 50.0, 100.0, 100.0, 100.0, 100.0, 100.0 ]

df1 = pd.DataFrame(index=timestamp, data={'BID_PRICE': bid_price,
    'BID_QTY': bid_qty, 'ASK_PRICE': ask_price, 'ASK_QTY': ask_qty})

df2 = pd.read_csv('in.csv', index_col=0, skip_blank_lines=True)
df2.index = pd.to_datetime(df2.index)

print df1
print df2
print
print df1.index
print df2.index
print
print df1.columns
print df2.columns
print
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)

print df1
print df2
print

df1m = df1.query('(BID_PRICE == 104.254) and (BID_QTY >= 50)').tail(1)
df2m = df2.query('(BID_PRICE == 104.254) and (BID_QTY >= 50)').tail(1)
print df1m
print df2m

The query on the CSV created DataFrame fails. As far as I can see it's the same data, index and column types, what is the difference between these two DataFrames?

What does the dataframe look like in debug? Printing the dataframe may not show it since the object might have a __str __ that formats the data in a way that obscures the problem. — Marcel Wilson, Jun 24 '16 at 14:49

score 2 · Accepted Answer · edited May 23 '17 at 11:53

It's a well known problem of comparing float values

Try this:

In [70]: df2.query('(abs(BID_PRICE - 104.254) < 0.000001) and (BID_QTY >= 50)')
Out[70]:
                         ASK_PRICE  ASK_QTY  BID_PRICE  BID_QTY
2016-06-17 16:38:02.978    104.259    100.0    104.254     50.0

instead of:

In [72]: df2.query('(BID_PRICE == 104.254) and (BID_QTY >= 50)')
Out[72]:
Empty DataFrame
Columns: [ASK_PRICE, ASK_QTY, BID_PRICE, BID_QTY]
Index: []

Simple demonstration:

In [73]: 2.2 * 3.0 == 6.6
Out[73]: False

In [74]: 3.3 * 2.0 == 6.6
Out[74]: True

score 0 · Answer 2 · answered Jun 24 '16 at 15:15

I don't know the answer but it seems related to the index column. I ran a simplified version of your code and it works as expected.

#!/usr/bin/env python

import pandas as pd

timestamp = [1, 2, 3, 4, 5, 6]
bid_price = [104, 105, 106, 107, 107, 107]
bid_qty = [100.0, 100.0, 50.0, 50.0, 150.0, 100.0]

df1 = pd.DataFrame(index=timestamp,
                   data={'BID_PRICE': bid_price, 'BID_QTY': bid_qty})

df2 = pd.read_csv('in.csv', index_col=0, skip_blank_lines=True)

print(df1)
print(df2)

df1m = df1.query('(BID_PRICE == 107) and (BID_QTY >= 50)').tail(1)
df2m = df2.query('(BID_PRICE == 107) and (BID_QTY >= 50)').tail(1)

print("Result 1: {}".format(df1m))
print("Result 2: {}".format(df2m))

---------------- in.csv file contents -----------

Index,BID_PRICE,BID_QTY
1, 104, 100.0
2, 105, 100.0
3, 106, 50.0
4, 107, 50.0
5, 107, 150.0
6, 107, 100.0

DataFrame created with read_csv() gives unexpected query() results

2 Answers2