0

I have a DataFrame in python by using pandas which has 3 columns and 80.000.000 rows.

The Columns are: {event_id,device_id,category}. here is the first 5 rows of my df

each device has many events and each event can have more than one category.

I want to run Apriori algorithm to find out which categories seem together.

My idea is to create a list of lists[[]]: to save the categories which are in the same event for each device. like: [('a'),('a','b')('d'),('s','a','b')] then giving the list of lists as transactions to the algorithm. I need help to create the list of lists.

If you have better idea please tell me because I am new in Python and this was the only way I found out.

Ayn
  • 93
  • 1
  • 2
  • 9

1 Answers1

0

A little bit of a late response here, but to me it seems like apriori might not be the right choice for your data. Traditional apriori looks at binary data (either "in the cart" or "not in the cart" for the classic market basket example), for a list of transactions that are all of the same type. What you seem to have is a multilevel/hierarchical association question that might be better suited to a more scalable algorithm.

That said, answering your formatting question, your first step would be to pivot your data so your transactions reflect rows, and the columns represent possible items to appear in each transaction. This can be achieved with DataFrame.pivot, and would look something like this (code from the docs, posted here for convenience):

df = pd.DataFrame({'foo': ['one','one','one','two','two','two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6]})
>>> df
    foo   bar  baz
0   one   A    1
1   one   B    2
2   one   C    3
3   two   A    4
4   two   B    5
5   two   C    6

df.pivot(index='foo', columns='bar', values='baz')
     A   B   C
one  1   2   3
two  4   5   6

From there you can create a list of lists from the dataframe using:

df.values.tolist()

That question was previously answered here.

If you end up using apriori, there's already a package that has implemented it, which could save you some time called apyori.

ZaxR
  • 4,896
  • 4
  • 23
  • 42