2

I am building a functionality to estimate Inventory for my Ads serve platform.The fields on which I am trying to estimate with their cardinality is as below:

FIELD: CARDINALITY

location: 10000 (bengaluru, chennai etc..)

n/w speed : 6 (w, 4G, 3G, 2G, G, NA)

priceRange : 10 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

users: contains number of users falling under any of the above combination.

Ex. {'location':'bengaluru', 'n/w':'4G', priceRange:8, users: 1000}

means 1000 users are from bengaluru having 4G and priceRange = 8

So total combination can be 10000 * 6 * 10 = 600000 and in future more fields can be added to around 29(currently it is 3 location, n/w, priceRange) and total combination can reach the order of 10mn. Now I want to estimate how many users fall under

Now queries I will need are as follows: 1) find all users who are from location:bengaluru , n/w:3G, priceRange: 6

2) find all users from bengaluru

3) Find all users falling under n/w: 3G and priceRange: 8

What is the best possible way to approach to this?

Which database can be best suited for this requirement.What indexes I need to build. Will compound index help? If yes then How ? Any help is appreciated.

Community
  • 1
  • 1
arpit bhawar
  • 21
  • 1
  • 7
  • While I agree the combinations are 600,000 at this time, I believe you're record count will be 1000*fields or 29000 if 29 fields. As each user has only one value for each of those attributes correct? So you could have a userFields table which simply defines the table user and value for the field in question.. – xQbert Apr 19 '17 at 19:46
  • 1000 is the number of users falling in that bucket. I believe the question is not clear as of now. Considering you approach How will you find the number of users from bengaluru and having priceRange as 7? – arpit bhawar Apr 19 '17 at 20:02
  • May be I didn't understood you solution clearly. Considering your approach How will you find the number of users from bengaluru and having priceRange as 7? let us take an example to explain the problem a bit more: few documents are there {'location':'bengaluru', 'n/w':'4G', priceRange:8, users: 1000}, {'location':'bengaluru', 'n/w':'4G', priceRange:7, users: 10}, {'location':'chennai', 'n/w':'4G', priceRange:8, users: 100} So for query like find all users from bengaluru will be 1010, users having 4G is 1110, users having priceRange as 8 is 1100 – arpit bhawar Apr 19 '17 at 20:09
  • `Select Count(1) from TableUservalues where (field,value) in (('location','bengaluru'),('priceRange',7)) having count(distinct concat(field,value))=2` Here's an example in a prior question: http://stackoverflow.com/questions/927724/what-is-a-sql-statement-to-select-an-item-that-has-several-attributes-in-an-item The only difference here is you have a paired value mapping where as they had only a key and a value. – xQbert Apr 19 '17 at 20:10
  • Other possible examples: http://stackoverflow.com/questions/24179584/what-will-be-the-sql-query-for-checking-same-pairs-of-column-values-in-a-table ... http://stackoverflow.com/questions/16803592/mysql-query-to-find-all-rows-that-have-the-same-values-as-another-row and More on pros/vs/cons for this approach: http://stackoverflow.com/questions/126271/key-value-pairs-in-relational-database – xQbert Apr 19 '17 at 20:23
  • what index will it use? – arpit bhawar Apr 19 '17 at 20:23
  • the table is structured UserID, FIeld, value so I'd have a Unique composite index on all 3. assume we can treat all values as character data; but we don't have to... we could have a means of knowing the paired value data type and cast it in code if needed. personally I think it would be wise to get a few others to look at this question and get their feedback! – xQbert Apr 19 '17 at 20:24
  • So if the fields are 29 then a composite index with 29 fields will be created right? now what if I want to make a query which extract aggregated sum of document matching only 4 fields. Will composite index be used in this? – arpit bhawar Apr 19 '17 at 20:27
  • I'd pay close attention to individuals responding with more than 20k rep! – xQbert Apr 19 '17 at 20:27
  • No. The table structure is just UserID, Field, Value. So the index is on 3 fields {USERID, Field, Value} {1,Location, bengaluru},{1,PriceRange,7}{2, Field29,xyz},{2,Location,chennai} in this example users 1 and 2 both have location attributes but neither are the same. This approach allows you to have a dynamic number of new attributes without having to modify the underlying table structure. Now to support relationships each join has to key off of the field/table and it's value. So the join to Location would key off only records with a field of "LOCATION" and values must exist in loc table – xQbert Apr 19 '17 at 20:29
  • Sure. We should wait for others as well to get their feedback. I believe we have headed another direction on this. Might be framing of the question is misleading. Will reframe the question if others also misinterpret it. Thanks xQbert – arpit bhawar Apr 19 '17 at 20:34
  • USERID is not the unique identifier for the user. I have mentioned users and not USERID. users in my question are the number of users falling under that bucket. – arpit bhawar Apr 19 '17 at 20:37
  • I think I understand what you're saying, i'm not sure you understand me yet... To get us on the same page... Here's an example on [SQL fiddle](http://sqlfiddle.com/#!9/4d230c/1/0). of what I'm thinking... – xQbert Apr 19 '17 at 20:48
  • Went through example, Thanks for your effort. This will work though it was an entirely different approach then I was thinking. Just one thing more to close this. We have a userbase of about 62mn and lets say most users have most of the fields. not sure if this solution will scale? wouldn't the query be too slow. – arpit bhawar Apr 19 '17 at 21:08
  • I need to revise my example to better support normal form and integridity, however I would think this scales better as no structure changes will be required when you add a new attribute. In addition with indexes, and table partioning if needed performance should be within tolerance for expected volume of data.. I'll revise example in the next few hrs. – xQbert Apr 20 '17 at 11:51
  • http://sqlfiddle.com/#!9/b8621/3/0 I like this approach better as you can add any attribute you want as data. No code changes as you add more attributes; and if the User interface (UI) is designed correctly, it can dynamically handle N attributes. So no code changes on UI or database as attributes change/scale. Plus referential integrity can be enforced at the database layer; so no bad data. Indexes are straight forward for performance. If performance does start to become an issue partitioning the tables in the database by attribute could speed things up allowing engine to parallel process. – xQbert Apr 20 '17 at 12:51
  • Also with this approach if an attribute name changes it changes in once place. There were a few things rubbing me wrong on my initial approach but we were so far apart I didn't want to really confuse you by eliminating all your attributes in tables. – xQbert Apr 20 '17 at 12:53
  • [Occams Razor](https://en.wikipedia.org/wiki/Occam%27s_razor) right?. All things being equal the simpler of two answers is the better right? (simpler theories are preferable to more complex ones because they are more testable) and this update just seemed to simplify things. – xQbert Apr 20 '17 at 13:15
  • @xQbert Thanks. This works for me. – arpit bhawar Apr 23 '17 at 13:19

1 Answers1

0

Here's my final answer:

Create table Attribute(
  ID int,
  Name varchar(50));

Create table AttributeValue(
 ID int,
 AttributeID int,
 Value varchar(50));

Create table userAttributeValue(
  userID int,
  AttributeID varchar(20),
  AttributeValue varchar(50));

Create table User(
  ID int);

Insert into user (ID) values (1),(2),(3),(4),(5);

Insert into Attribute (ID,Name) Values (1,'Location'),(2,'nwSpeed'),(3,'PriceRange');
Insert into AttributeValue values 
  (1,1,'bengaluru'),(2,1,'chennai'),
  (3,2, 'w'), (4, 2,'4G'), (5,2,'3G'), (6,2,'2G'), (7,2,'G'), (8,2,'NA'),
  (9,3,'1'), (10,3,'2'), (11,3,'3'), (12,3,'4'), (13,3,'5'), (14,3,'6'), (15,3,'7'), (16,3,'8'), (17,3,'9'), (18,3,'10');

Insert into UserAttributeValue (userID, AttributeID, AttributeValue) values
(1,1,1),
(1,2,5),
(1,3,9),
(2,1,1),
(2,2,4),
(3,2,6),
(2,3,13),
(4,1,1),
(4,2,4),
(4,3,13),
(5,1,1),
(5,2,5),
(5,3,13);

Select USERID
from UserAttributeValue
where (AttributeID,AttributeValue) in ((1,1),(2,4)) 
GROUP BY USERID
having count(distinct concat(AttributeID,AttributeValue))=2

Now if you need a count wrap userID in count and divide by the attributes passed in as each user will have 1 record per attribute and to get the "count of users" you'd need to divide by the number of attributes.

  1. This allows for N growth of Attributes and the AttributeValues per user without changes to UI or database if UI is designed correctly.
  2. By treating each datapoint as an attribute and storing them in once place we can enforce database integrity.
  3. Attribute and AttributeValue tables becomes lookups for UserAttributevalue so you can translate the IDs back to attribute name and the value.
  4. This also means we only have 4 tables user, attribute, attributeValue, and UserAttributeValue.
  5. Technically you don't have to store attributeID on the userAttributeValue, but for performance reasons on later joins/reporting I think you'll find it beneficial.
  6. You need to add proper Primary Key's, Foreign keys, and indexes to the tables. They should be fairly self explanatory. On UserAttributeValue I would have a few Composite indexes each with a different order of the unique key. Just depends on the type of reporting/analysis you'll be doing but adding keys as performance tuning is needed is commonplace.

Assumptions:

  1. You're ok with all datavalues being varchar data in all cases.
  2. If needed you could add a datatype, precision, and scale on the attribute table and allow the UI to cast the attribute value as needed. but since they are all in the same field in the database they all have to be the same datatype. and of the same precision/scale.
  3. Pivot tables to display the data across will likely be needed and you know how to handle those (and engine supports them!)

Gotta say I loved the metal exercise; but still would appreciate feedback from others on SO. I've used this approach in 1 systems I've developed and it's been in two I've supported. There are some challenges but it does follow 3rd normal form db design (except for the replicated attributeID in userAttributevalue but that's there for performance gain in reporting/filtering.

xQbert
  • 34,733
  • 2
  • 41
  • 62