1

Good evening all,

I've been wrestling with this one all day.

I'm attempting to update a column in my customer table that will show the number of times they have been a subscriber. This table is created from a big data dump that has individual rows for every customer, and every month they were a subscriber (allocated an iterating number for each month). The names are unique. It looks something like this and shows that (for example) Jane Doe was a subscriber in period 1000, but not in period 1002.

Row_ID Customer_Name  Date_Code 
1      Jane Doe       1000      
2      Jane Doe       1001      
3      Jane Doe       1004      
4      Jane Doe       1005      
5      Ted Jones      1000      
6      Ted Jones      1001      
7      Ted Jones      1002      
etc...

In this case Jane Doe was a subscriber from 1000-1001, left our subscription, and then came back from 1004-1005. I have a master table that includes all the date logic (start date, end date, date code etc...). It looks roughly like this:

Start_Date   End_Date    Date_Code
1990-01-01   1990-03-31  1000
1990-04-01   1990-06-30  1001
1990-07-01   1990-09-30  1002
1990-10-01   1990-12-31  1003
etc...

I'm trying to find a way to make the output something like:

Customer_Name  Subscription_Count
Jane Doe       2
Ted Jones      1

Has anyone run into something like this before? It's obvious to me (as a human) that the numbers are (or not) consecutive and are (or not) a representation of the entire sample, but I'm not sure how to make MYSQL understand it. I appreciate any ideas.

*EDIT - I tried both the Join and Where Not Exists alternatives, and both timed out after 10 minutes. I believe it's due to the size of the main table (~100,000 lines). Do you have any suggestions? Thanks again for the all the comments.

**EDIT #2 - After adding indices and tweaking my tables a bit, both solutions work great. Thanks again for the support in figuring this out.

Skyline29
  • 45
  • 6
  • So is it correct to assume that consecutive numbers always form *one* subscription? And do we care about the magazine type? How to treat two different magazines for the same period of time? Aggregate all periods for all magazines per customer? – Erwin Brandstetter Mar 11 '12 at 02:20
  • Yes. Jane can only subscribe to one product at a time. As such, the date_code '1000' can only show up once for Jane. – Skyline29 Mar 11 '12 at 02:22
  • How to handle if Jane subscribes to magazine A in 1000, then B in 1001, then C in 1002? One subscription? Or three subscriptions? – Erwin Brandstetter Mar 11 '12 at 02:24
  • The way I have the data broken down, each 'subscription type' has it's own table, so we can ignore that possibility altogether (sorry, should have deleted that column from the info. presented above). In this case, assume that she either is a subscriber or not, and that product does not matter. – Skyline29 Mar 11 '12 at 02:27
  • Also, is `Customer_name` unique? That does not normally work, there should be some kind of `customer_id`. Please edit your question with the updated information. That's the recommended way. – Erwin Brandstetter Mar 11 '12 at 02:31
  • 1
    I edited the question per your recommendation. Customer name is unique. – Skyline29 Mar 11 '12 at 02:42

2 Answers2

1

Query could look something like this:

SELECT customer_name, count(*) AS subscriptions
FROM   tbl AS t
WHERE NOT EXISTS (
    SELECT *
    FROM tbl AS t1
    WHERE t1.customer_name = t.customer_name
    AND t1.date_code = t.date_code + 1
    )
GROUP BY customer_name;

The trick here is to exclude all rows but one per series of date_codes of a customer and then count: only the last row per block does not have a successor (date_code + 1).

I am assuming that successive date_codes form one subscription (as per my first comment to the question). Consequently the additional information on Start_Date and End_Date is not needed.


Performance

LEFT JOIN / IS NULL should in fact be a bit faster than NOT EXISTS in MySQL (as @nnichols supplied).
The much more important thing for performance are indexes. For this to be fast you need indexes on customer_name and on date_code. Like this:

CREATE INDEX tbl_customer_name ON tbl(customer_name);
CREATE INDEX tbl_date_code ON tbl(date_code);
Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • You are correct, consecutive date_codes constitute a single subscription. I'm running the query, but having some difficulty. To confirm, do both 't' and 't1' refer to the same table? – Skyline29 Mar 11 '12 at 03:03
  • @Skyline29: `t1` is an alias for the same table, yes. I added the (optional) keyword `AS` now to make it clearer. Two different aliases are needed for the self-semi-join `EXISTS`. – Erwin Brandstetter Mar 11 '12 at 03:13
  • @Skyline29: I added a bit about indexes to my answer. FYI: answerers do not get notified automatically about changes to the question. – Erwin Brandstetter Mar 11 '12 at 04:45
1

I cannot be 100% sure that this is still the case but LEFT JOIN / IS NULL is generally faster than NOT EXISTS in MySQL -

SELECT t1.customer_name, COUNT(*) AS subscriptions
FROM   tbl t1
LEFT JOIN tbl t2
    ON t1.customer_name = t2.customer_name
    AND t1.date_code + 1 = t2.date_code
WHERE t2.customer_name IS NULL
GROUP BY t1.customer_name

UPDATE The addition of a composite index across these two fields instead of the two single column indices gives a significant performance boost -

CREATE UNIQUE INDEX `UQ_customer_date_code` ON tbl (customer_name, date_code);

I have done some tests using a test table with 1.6 million records (100k customers across 21 date_codes). With this index added the query time reduces by about 80%. Using the LEFT JOIN instead of NOT EXISTS only reduces query time by about 15%.

user1191247
  • 10,808
  • 2
  • 22
  • 32
  • By moving the customer_name to a `customers` table with an integer surrogate key gave a further reduction in query time of nearly 30% and a space saving of 20% – user1191247 Mar 12 '12 at 15:41