2

I created a database that have following table for user activities:

user_id        |  INTEGER  |  user being notified
actor_id       |  INTEGER  |  user performing the action
activity_type  |  STRING   |  classname/type of the object being notified
activity_id    |  INTEGER  |  id of the object being notified
context_type   |  STRING   |  classname/type of the object's parent
context_id     |  INTEGER  |  id of the object's parent
read/view_at   |  DATETIME |  timestamp of when user saw it

I don't know if this design is good in terms of performances. So I want to test it somehow. I can populate database with a lot of data and open page where I need to get data from this table but I don't know if this is relevant test.

Is there a good way to see how this design will perform when there are 10.000 users and 50.000.000 records in feed table?

I don't know if this can help but this I am using MS SQL Server

1110
  • 7,829
  • 55
  • 176
  • 334
  • 4
    Build it; measure it; **then** tune it. – Pieter Geerkens Jan 26 '14 at 17:32
  • That means to put this design to production. But what if I get 50.000.000 activities and I find out that users can't open page because of slow database. I need to be sure that this design for activities will work. – 1110 Jan 26 '14 at 17:35
  • 2
    What is your hardware? Are you using SSD's? What are your indices? Have/Can you design schema-bound views to cache aggregates? What are your FK's? I have only begun to ask the relevant questions needed to address your question. **Just build a 3NF design** and address performance concerns when you actually have a problem. – Pieter Geerkens Jan 26 '14 at 17:38
  • what @PieterGeerkens is saying that it would be easy to create DB with those number and test it under the QPS that you need. it depend on you queries and more. – Mzf Jan 26 '14 at 17:39
  • Do you really expect 50M hits when you start your application up? Bit odd of an estimate to me. – Rachcha Jan 26 '14 at 17:40
  • This is for start up. I will use shared hosting so nothing special about hardware. There are no FK with this design this will be lonely table. There are other tables like Comments,Posts,Groups etc. I don't expect fast growth to 50.000.000 :) but I need to design this database right way. And I have never build something close to this (mini social network) and must be sure that this design will go for a long time. – 1110 Jan 26 '14 at 17:47
  • Please note that using things like `activity_type` to "choose" the table to reference (instead of having proper FOREIGN KEYs) is considered an anti-pattern. For reasons why, you may want to look at: [Sql - Indirect Foreign Key](http://stackoverflow.com/a/20873843/533120). For some alternatives, you may want to look [here](http://stackoverflow.com/a/13317463/533120). – Branko Dimitrijevic Jan 27 '14 at 00:18

3 Answers3

2

If performance is critical, it's something you need to build into your development pipeline.

What I've done on performance-critical projects:

  • Create a performance test data set; in your case, I'd load a table with 100M test records, making the best possible guess at the distribution of the data.
  • Write the queries the application will be using to query that data
  • Write a test script which can execute those queries; I like Gatling.
  • Make a guess on the relationship of my test hardware's relationship to the production kit. Ideally, it's identical - but in practice, that may be too expensive. You have to improve that guess based on what you see when you can observe production behaviour.
  • Make a guess on traffic patterns - 10K users isn't that much if they come in a nice, evenly distributed pattern (10K / 24 hours / 60 minutes / 60 seconds = less than 1 request per second); if they all turn up at exactly the same time, you have a problem. I typically have 3 load patterns:
    • gradual ramp up (this is "best case"),
    • aggressive ramp-up (this reflects what you'd expect on busy time periods, e.g. if your users tend to visit within a 1 hour period),
    • and worst-case (reflecting what happens if there's a viral video spike or similar)
  • Schedule my gatling tests to run at least once every day, ideally as part of each check-in
  • Set targets for performance, e.g. "90% of queries in 500ms or less"; when the tests exceed these targets, dedicate time and attention to improving
  • Once you have access to live infrastructure, run tests
  • once you have visitors, monitor query performance, adjust your load tests and targets
Neville Kuyt
  • 29,247
  • 1
  • 37
  • 52
1

This has worked for me so far:

  1. Identify the set if queries that are performance critical.1
  2. Design the database so the DBMS can use fast execution plans for the queries identified above.2
  3. Fill the database with representative amounts of test data and make sure the query execution plans anticipated above are really being used (and the performance of each individual query is satisfactory).
  4. Benchmark the whole "vertical stack" of your software. By now, there shouldn't be too many bad surprises.

As you can see, the database design is not just about fulfilling business requirements. Indeed, understanding how clients intend to access the data is integral part of the design process.

In other words, what is traditionally though of as "logical design" is not enough - there needs to be a "two-way street" between logical and physical design.

Is there a good way to see how this design will perform when there are 10.000 users and 50.000.000 records in feed table?

That sentence doesn't contain enough information to be answered.

First tell use how you intend to access the data, then perhaps we can help you (re)design the database to support that particular access pattern efficiently.


1 For example, if "loading a page" is critical for the user experience (which it usually is), make a list of queries that will typically be executed during that process.

2 Which includes "physical" techniques such as indexing, clustering, partitioning etc., but can also include some aspects that are traditionally though of as "logical" design, such as the design of keys and relationships. If you understand the underlying data structures that the typical DBMS uses, you'll be able to predict the performance consequences of most of your design decisions. Use The Index, Luke! is an excellent introduction on the topic.

Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167
  • Based on schema I need to get all records that interest particular user. So this would be simple `select top 10 * from activities where user_id = 123 order by read/view_at desc`. I thought as I have all ID's here to based on type use joins to different tables but I think this will make me a problem so I decided to keep as much data as I can in this table. I know this denormalize database but can't find better way to do this. – 1110 Jan 27 '14 at 14:34
0

Regarding performance, no assumptions are safe. You should follow these steps:

  • Define performance goals, regarding load and response time. You should identify the accepted response times for specific data load.

  • Define resources. Not every machine configuration will behave the same. You should build a Virtual Machine with the target configuration.

  • Execute load testing. You could do it by using tools like SQL Server load generator.

Hope this helps!

Igor Konoplyanko
  • 9,176
  • 6
  • 57
  • 100
Pantelis Natsiavas
  • 5,293
  • 5
  • 21
  • 36