1

I'm using DataFrame.query() to find rows, and I've come across a problem that I can only replicate if the data is loaded from CSV. If I create what I think is the same DataFrame in pure Python, the query() works as expected.

This is the CSV of data:

,ASK_PRICE,ASK_QTY,BID_PRICE,BID_QTY
2016-06-17 16:38:00.043,104.258,50.0,104.253,100.0
2016-06-17 16:38:00.043,104.259,100.0,104.253,100.0
2016-06-17 16:38:02.978,104.259,100.0,104.254,50.0
2016-06-17 16:38:03.999,104.259,100.0,104.253,50.0
2016-06-17 16:38:03.999,104.259,100.0,104.251,150.0
2016-06-17 16:38:04.001,104.259,100.0,104.251,100.0

And this is an example script showing the problem:

#!/usr/bin/env python
import pandas as pd
import numpy as np
from datetime import datetime

timestamp = [
        datetime.strptime('2016-06-17 16:38:00.043', '%Y-%m-%d %H:%M:%S.%f'),
        datetime.strptime('2016-06-17 16:38:00.043', '%Y-%m-%d %H:%M:%S.%f'),
        datetime.strptime('2016-06-17 16:38:02.978', '%Y-%m-%d %H:%M:%S.%f'),
        datetime.strptime('2016-06-17 16:38:03.999', '%Y-%m-%d %H:%M:%S.%f'),
        datetime.strptime('2016-06-17 16:38:03.999', '%Y-%m-%d %H:%M:%S.%f'),
        datetime.strptime('2016-06-17 16:38:04.001', '%Y-%m-%d %H:%M:%S.%f')
        ]
bid_price = [ 104.253, 104.253, 104.254, 104.253, 104.251, 104.251 ]
bid_qty = [ 100.0, 100.0, 50.0, 50.0, 150.0, 100.0 ]
ask_price = [ 104.258, 104.259, 104.259, 104.259, 104.259, 104.259 ]
ask_qty = [ 50.0, 100.0, 100.0, 100.0, 100.0, 100.0 ]

df1 = pd.DataFrame(index=timestamp, data={'BID_PRICE': bid_price,
    'BID_QTY': bid_qty, 'ASK_PRICE': ask_price, 'ASK_QTY': ask_qty})

df2 = pd.read_csv('in.csv', index_col=0, skip_blank_lines=True)
df2.index = pd.to_datetime(df2.index)

print df1
print df2
print
print df1.index
print df2.index
print
print df1.columns
print df2.columns
print
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)

print df1
print df2
print

df1m = df1.query('(BID_PRICE == 104.254) and (BID_QTY >= 50)').tail(1)
df2m = df2.query('(BID_PRICE == 104.254) and (BID_QTY >= 50)').tail(1)
print df1m
print df2m

The query on the CSV created DataFrame fails. As far as I can see it's the same data, index and column types, what is the difference between these two DataFrames?

Luke Bigum
  • 110
  • 1
  • 7
  • What does the dataframe look like in debug? Printing the dataframe may not show it since the object might have a __str __ that formats the data in a way that obscures the problem. – Marcel Wilson Jun 24 '16 at 14:49

2 Answers2

2

It's a well known problem of comparing float values

Try this:

In [70]: df2.query('(abs(BID_PRICE - 104.254) < 0.000001) and (BID_QTY >= 50)')
Out[70]:
                         ASK_PRICE  ASK_QTY  BID_PRICE  BID_QTY
2016-06-17 16:38:02.978    104.259    100.0    104.254     50.0

instead of:

In [72]: df2.query('(BID_PRICE == 104.254) and (BID_QTY >= 50)')
Out[72]:
Empty DataFrame
Columns: [ASK_PRICE, ASK_QTY, BID_PRICE, BID_QTY]
Index: []

Simple demonstration:

In [73]: 2.2 * 3.0 == 6.6
Out[73]: False

In [74]: 3.3 * 2.0 == 6.6
Out[74]: True
Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
0

I don't know the answer but it seems related to the index column. I ran a simplified version of your code and it works as expected.

#!/usr/bin/env python

import pandas as pd

timestamp = [1, 2, 3, 4, 5, 6]
bid_price = [104, 105, 106, 107, 107, 107]
bid_qty = [100.0, 100.0, 50.0, 50.0, 150.0, 100.0]

df1 = pd.DataFrame(index=timestamp,
                   data={'BID_PRICE': bid_price, 'BID_QTY': bid_qty})

df2 = pd.read_csv('in.csv', index_col=0, skip_blank_lines=True)

print(df1)
print(df2)

df1m = df1.query('(BID_PRICE == 107) and (BID_QTY >= 50)').tail(1)
df2m = df2.query('(BID_PRICE == 107) and (BID_QTY >= 50)').tail(1)

print("Result 1: {}".format(df1m))
print("Result 2: {}".format(df2m))

---------------- in.csv file contents -----------

Index,BID_PRICE,BID_QTY
1, 104, 100.0
2, 105, 100.0
3, 106, 50.0
4, 107, 50.0
5, 107, 150.0
6, 107, 100.0
Andrei Sura
  • 2,465
  • 1
  • 20
  • 15