0

Trying to determine the best way to count transactions in a range by identifier for a project. Below is a sample csv dataset.

Customer A,$1.55 ,12:01
Customer B,$12.95 ,12:02
Customer A,$28.77 ,12:03
Customer C,$35.65 ,12:04
Customer A,$11.95 ,12:05
Customer C,$7.65 ,12:06
Customer B,$55.96 ,12:07
Customer C,$44.25 ,12:08

If I have thousands of transaction occurring per second, and I want to know the number (count) of transactions for each customer in a range, say $0-$10, $11-20, $21-$30, $31-$40, and so on, what is the most efficient way to do this?

If I make sublists in python like

Customer A = [$1, $5, $12, $25, $18, $11]

Then I can do it pretty easily with a count function of some sort like:

def count_range_in_list(li, min, max):
        ctr = 0
        for x in li:
            if min <= x <= max:
                ctr += 1
        return ctr

print(count_range_in_list(Customer A, 0, 10))

[source: https://www.w3resource.com/python-exercises/list/python-data-type-list-exercise-31.php ]

Just struggling with an efficient way to do this from a large dataset. Since I know my customers I could do a comparison between the transaction data and that list and try to parse it that way, then do the count by range?

Any help getting started would be nice. Thanks.

PA_Commons
  • 269
  • 2
  • 6
  • 14
  • Have you considered using a dictionary to store the customers' values, as opposed to assigning each to its own variable? – JaonHax Sep 25 '20 at 20:12
  • I could certainly do that, but I still think my problem persists in that I would have to find an efficient way to count the number of transactions in the desired range. – PA_Commons Sep 25 '20 at 20:17
  • Using a multindexed dataframe shall be good here. The Customers as the first index and price range as columns – Vishesh Mangla Sep 25 '20 at 20:18
  • @VisheshMangia I am reviewing this now, it looks promising. Thank you for the suggestion – PA_Commons Sep 25 '20 at 20:22
  • I m not a regular pandas user therefore I cant really help without a bit of practice but it might be a good idea to find related answers on SO since I don't think the docs are friendly on multiindexing part – Vishesh Mangla Sep 25 '20 at 20:51

1 Answers1

1

It sounds like you're trying to do an aggregation on a table, which has a couple of options. If you can get away with using dataframes then a pandas groupby operation would be the easiest, but definitely not the most efficient.

You can also use binning methods. relevant post

If SQL is an option it would make the operation much easier and efficient.

For example:

with customer_a as (
select * 
, if (amount < 10, 1, 0) as cat_1
, if (amount > 10 and amount < 20, 1, 0) as cat_2
from customer_transactions
where customer = 'Customer A'
)

select customer
, sum(cat_1) as cat_1_count
, sum(cat_2) as cat_2_count
from customer_a
group by 1
willwrighteng
  • 1,411
  • 11
  • 25