subsetting a Python DataFrame

Question

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:

k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))

Now, I want to do similar stuff in Python. this is what I have got so far:

import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")


#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
 data.set_index('Product')
 k = data.ix[[p.id, 'Time']]

# then, index this subset with Time and do more subsetting..

I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:

k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))

thanks.

`df.query` and `pd.eval` seem like good fits for this use case. For information on the `pd.eval()` family of functions, their features and use cases, please visit [Dynamic Expression Evaluation in pandas using pd.eval()](https://stackoverflow.com/questions/53779986/dynamic-expression-evaluation-in-pandas-using-pd-eval). — cs95, Dec 16 '18 at 04:54

score 98 · Accepted Answer · edited Jul 05 '17 at 17:40

I'll assume that Time and Product are columns in a DataFrame, df is an instance of DataFrame, and that other variables are scalar values:

For now, you'll have to reference the DataFrame instance:

k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']]

The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators. The & operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.

In pandas 0.13 a new experimental DataFrame.query() method will be available. It's extremely similar to subset modulo the select argument:

With query() you'd do it like this:

df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr')

Here's a simple example:

In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})

In [10]: df
Out[10]:
  gender  price
0      m     89
1      f    123
2      f    100
3      m    104
4      m     98
5      m    103
6      f    100
7      f    109
8      f     95
9      m     87

In [11]: df.query('gender == "m" and price < 100')
Out[11]:
  gender  price
0      m     89
4      m     98
9      m     87

The final query that you're interested will even be able to take advantage of chained comparisons, like this:

k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time')

Thanks Philip. It works well. This is what I was looking for - a simple, quick solution. Many thanks again. For those searching for such solution, the time I used is like this: (data.ts >= '2012-10-01') & (data.ts < '2013-05-01') . — user1717931, Oct 08 '13 at 03:02
@Philip, I tried your suggestion from iPython with concrete values in my conditions. They worked fine. But, when I embed the same in a program and call with parameters, I get an error - the final lines are: `code` File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 225, in wrapper if len(self) != len(other): Type Error: len() of unsized object — user1717931, Oct 08 '13 at 15:55
More about the above (ill-formatted) error message: I checked my data set and made sure there are no NaNs or NAs, and nothing is empty. Not sure why a call to my function errors out. I have my function like this: `code` def get_data(data, p_id, start_time, end_time): test_data = data.loc[(data.product == p_id) & (data.ts >= start_time) & (data.ts < end_time), ['product', 'ts']] `code` — user1717931, Oct 08 '13 at 16:01
please ignore that error. It was error coming out of converting Python array to numpy array - and finding the mean of a column resulted in some 'dtype' error. — user1717931, Oct 08 '13 at 17:47

score 20 · Answer 2 · edited May 12 '15 at 10:21

20

Just for someone looking for a solution more similar to R:

df[(df.Product == p_id) & (df.Time> start_time) & (df.Time < end_time)][['Time','Product']]

No need for data.loc or query, but I do think it is a bit long.

edited May 12 '15 at 10:21

ali_m

71,714
23
223
298

answered Mar 26 '14 at 22:22

sernle

813
8
13

gpicard · Answer 3 · 2020-06-16T18:32:38.470

I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']

And let's say you want to include products made before 2014. You could write,

df[df['Year'] < 2014]

To return all the rows where this is the case. You can add different conditions.

df[df['Year'] < 2014][df['Color' == 'Red']

Then just choose the columns you want as directed above. For instance, the product color and key for the df above,

df[df['Year'] < 2014][df['Color'] == 'Red'][['Product','Color']]

score 0 · Answer 4 · edited Jun 20 '20 at 09:12

Regarding some points mentioned in previous answers, and to improve readability:

No need for data.loc or query, but I do think it is a bit long.

The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators.

I like to write such expressions as follows - less brackets, faster to type, easier to read. Closer to R, too.

q_product = df.Product == p_id
q_start = df.Time > start_time
q_end = df.Time < end_time

df.loc[q_product & q_start & q_end, c('Time,Product')]

# c is just a convenience
c = lambda v: v.split(',')

score -1 · Answer 5 · edited Nov 12 '18 at 09:54

Creating an Empty Dataframe with known Column Name:

Names = ['Col1','ActivityID','TransactionID']
df = pd.DataFrame(columns = Names)

Creating a dataframe from csv:

df = pd.DataFrame('...../file_name.csv')

Creating a dynamic filter to subset a dtaframe:

i = 12
df[df['ActivitiID'] <= i]

Creating a dynamic filter to subset required columns of dtaframe

df[df['ActivityID'] == i][['TransactionID','ActivityID']]

subsetting a Python DataFrame

5 Answers5

Linked

Related