3

Question

What is the correct or best way to query a pandas DataFrame?

Is it depending on the use case or can you say "always use .query()" or "never use .query()"?

My primary concern is robustness or error-proof-ness of the code, but of course performance is also relevant.

In this post the query method is stated to be robust and preferred over the other methods, do you agree? Should I always use .query()?

DataFrame.query() function in pandas is one of the robust methods to filter the rows of a pandas DataFrame object.

And it is preferable to use the DataFrame.query() function to select or filter the rows of the pandas DataFrame object instead of the traditional and the commonly used indexing method.

Background

I recently came across the .query() method and started to use it quite frequently for convenience and because I thought this was the way to do it properly.

Then I read these two posts (the content is not essential for this question, I just want to show what made me think about it):

apply, the Convenience Function you Never Needed

and

How to deal with SettingWithCopyWarning in Pandas?

In the post about SettingWithCopyWarning different methods like .loc and .at are mentioned, but not .query(). This made me wonder whether .query() is really used. (I thought I start a new question rather than posting this in the comments). It might also not have been relevant for that specific problem, but it made me wonder none the less.

The post about "apply - the convenience function..." made me wonder whether .query() is also a convenience function you never need.

The documentation mentions the following use case:

query() Use Cases

A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common. You can pass the same query to both frames without having to specify which frame you’re interested in querying

Edit: fixed the link to .apply() question.

TheRibosome
  • 301
  • 1
  • 9

1 Answers1

2

I don't think there is a hard answer for this question. To answer your question regarding should you always use query, the simple answer is no.

The query method uses eval behind the scenes, which makes it less performant. So when should you use query? You should use query if the condition you're trying to filter is incredibly specific and involves multiple columns. While the most panda-esque way of filtering is using loc, there are times when chaining loc after loc gets out of hand.

The decision to use the query method should be based on readbility and performance. If you're using loc over and over again, you may wish to revise it using a simple query string. However, if the switch makes your code less performant, and you are working with mission-critical data, you should sacrifice a little bit of readability over performance.

J Lee
  • 124
  • 7