1

I'm a python beginner that's working with a large csv file of online order data.

I'm trying to see what skus people most frequently purchase with a specific sku, we'll call it grey-shirt711.

I'm struggling to express how to say "show all orders that contain grey-shirt771 and at least one other sku". I keep merely retrieving all orders that have grey-shirt711 in it, which 90% of the time is only that sku.

Assuming I'm only dealing with these two columns ('sku' and 'orderID'), what's the simplest way I could express this statement?

Thank you!

Joe H
  • 11
  • 1

1 Answers1

0

We'd like to help but you need to be a little more specific. Can you provide an example of what you've tried. Can you show us how you're reading in the data? Like Boris suggests, you'll likely want to do this using Pandas. Here's a snippet that will filter a dataframe on a column of your choosing:

import pandas as pd
import numpy as np

d = {'col1': [1, 2], 'col2': [3, 4]} # Should be your data import line...
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100, 2)),
                   columns=['sku','orderID'])

#%% Alternatively, load your data using Pandas by uncommenting the lines below
# df = pd.read_excel('path_to_your_file') #If using excel

# Method 1
filter1 = 6 #replace 6 with grey-shirt771
filter2 = 3 # replace this with another sku of interest
df_items_of_interest1 = df[(df['sku'] == filter1) | (df['sku'] == filter2)]

# Method 2
filter1 = 'sku == 6'
filter2 = 'sku == 3'
df_items_of_interest2 = df.query(filter1 + '|' + filter2)

# Method 3
df_items_of_interest3 = df[df['sku'].isin([6,3])]

Refer to this SO Post and the Pandas documentation for clarity.

I hope that helps. On behalf of the Stack Overflow community, I say welcome. To maximize the value you'll get from using this site (and to help us help you) try out some of these tips

zelusp
  • 3,500
  • 3
  • 31
  • 65
  • Thank you! I know it's tough to give much advice without me being more specific..I've just been trying random stuff (i.e. guessing), so I don't know if sharing would help? I'm trying to above methods and what I keep getting as output is just one sku or the other. For example: filter1 = 'blue-shirt' filter2 = 'white-socks' df_items_of_interest1 = df[(df['sku'] == filter1) | (df['sku'] == filter2)] is returning orders that have either blue-shirt or white-socks in them, but what I want is to see orders that have blue-shirt AND white-socks in them. Probably something I'm missing. – Joe H Jan 26 '18 at 23:20
  • If your `sku` items are on different rows like your picture shows then you're going to have to forward fill your `orderID` column (it appears it is missing data). You can do that with `df['sku'] = df['sku'].ffill()`. See [here](https://stackoverflow.com/questions/27012151/forward-fill-specific-columns-in-pandas-dataframe) for details. Then you can [groupby](https://pandas.pydata.org/pandas-docs/stable/groupby.html) `orderID`. Finally filter your dual conditional with something [like this](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.filter.html) – zelusp Jan 28 '18 at 06:52
  • If my post has helped you then please show your appreciation by upvote :) – zelusp Jan 28 '18 at 06:53