redundancy in the database vs efficiency in the code

Question

I would like to explain my question with an example. Let's say there are many courses that students can register. Each course can have many discussion walls. Each discussion wall contains questions. Each question may have replies. And badges can be assigned to replies (or comments).

In my case, I need to know that which reply belongs to which course (when listing replies), and the same for the badges. I am able to do this with Entity Framework but the queries are becoming very complex and causing performance problems.

In this scenario, is it better to have a CourseId column in Replies (or BadgeAssignments) table? Or not? This would make my life a lot easier in some aspects, but not sure in long term. What do you think? Is it sometimes better to have some redundancy? I do no think I will need to update CourseId field later.

It is my belief that there is a best possible equilibrium between the two that can be achieved, but without more concrete information on size, usage or structure it would be difficult to assess, or is this hypothetical? In which case shouldn't it be on programmers? — J-Boss, Sep 05 '15 at 03:59
It is an app I am working on. Can you elaborate what you mean by the size, structure, et.? @J-Boss — renakre, Sep 05 '15 at 04:02
Well for example since you are doing this in the Entity Framework, How many student records are you expecting or do have, how many courses are you like to have in total, etc. For usage how many accesses, by what means, how many of these highly complex joins are you likely to get? can better be solved by renormalization of the object framework? by structure I mean How many entities, with how many types? For Example: — J-Boss, Sep 05 '15 at 04:26
@J-Boss I may have first thousands (and later it may event get to millions of students), and so many courses. I have around 50 entities. Having an additional CourseId field would prevent some complex joins. What do you think? What is the best practice? — renakre, Sep 05 '15 at 12:29

score 2 · Answer 1 · answered Sep 08 '15 at 16:13

My pet peeve is sacrificing data integrity for performance. Obtaining a less-than-reliable answer faster is not a good solution. However, changes that improve performance that do not sacrifice data integrity are fine.

Redundancy may well sacrifice data integrity. It is certainly a critical point where anomalous data can start. The problem is that both "sets" of data must be rigidly synchronized which, depending on the design, may be easy or difficult to do. Either way, it takes system resources to maintain the synchronization so you are adding another hit on performance.

Fortunately, that performance hit will be added to the DML operations as that is where the synchronization will be performed. In general, shifting performance time from queries to DML (which are usually less sensitive to response time) can be a good solution.

The devil is in the details, however, and you provide no details. Can the performance be improved sufficiently without redundancy? What is the level of difficulty in maintaining synch between the redundant data? Another way of asking that last question would be: how likely is it for anomalous (unsynched) data to creep into the system? How much of a problem will unsynched data be and how difficult will it be to fix it?

There is not nearly enough information provided to answer these questions. But keep them in mind as you investigate solutions.

score 2 · Answer 2 · answered Sep 09 '15 at 18:23

Each component of your system should be utilized as it was designed to make it the "best". Things work better when they work according to their design. This, strictly, is my answer to your question.

The Relational Database

The purpose of a relational database is first to govern the integrity of your information and second to provide a storage and retrieval system. The RDMS governs your truth which then determines the way it should be stored and retrieved.

Since it is difficult, but not impossible, for us to imagine the uniqueness of digital discussion walls and of questions and replies we will typical use surrogate keys (i.e. auto generated numbers) for those entities' primary keys. This means the decision to add the Course ID to Questions, Replies, or BadgeAssignments will violate the principals relational design. You may say "no biggie" in this instance, but it is a violation nonetheless and will have consequences as long as it persists (pun intended).

If we used natural keys for Courses, Walls, Questions, Replies, and BadgeAssignments then our Primary Keys for each of those tables would be composites from those tables. We would then, for example, have the Primary Key of Courses within the Composite Primary Key of Replies without violating any principal of redundancy or normalization and your life would be "easier".

That said, what is so hard about this query?

SELECT
    D.CourseId, D.CourseName
    ,A.ReplyId, A.ReplyName
FROM
    Replies A
    JOIN Questions B On A.QuestionId = B.QuestionId
    JOIN Walls C ON B.WallId = C.WallId
    JOIN Courses D ON C.CourseId = D.CourseId

Entity Framework

Entity Framework (EF) can be configured to match your design whether we put CourseId in Replies or whether we rely on the joins. But, we can usually do better than EF when it comes to SQL performance.

One option would be a to craft a SQL query (starting with the one above) that has the highest amount of optimization according to your need, and turn it into a View. Then, map a C# class to the View (instead of the tables) and the interactions are simplified. We would be letting EF exceed in providing low hassle data access and SQL succeed at retrieving data.

Here is the difference in C# Linq...

var replies = context.Replies
    .Where(x => x.Questions.Walls.CourseId == 1)
    .Select(x => new ReplyView
    {
        CourseId = x.Questions.Walls.Courses.CourseId,
        CourseName = x.Questions.Walls.Courses.CourseName,
        ReplyId = x.ReplyId,
        ReplyName = x.ReplyName
    }).ToList();

versus

var replies = context.RepliesView.Where(x => x.CourseId == 1).ToList();

thanks for your answer! I have a question. If I want to retrieve the list of Questions and the CourseId for each question then how can I use `ReplyView` ? Do I need to use this first `context.RepliesView.Where` then map the results to new Course objects? — renakre, Sep 14 '15 at 15:19
sorry for the delay... you would probably need a new class, call it `QuestionView`, that would use the JOINS back to the Courses entity. When you want to retrieve the `QuestionsView` object(s), the you would use something like `context.QuestionView.Where` to query that `DbSet`. — travis.js, Sep 17 '15 at 01:42

score 1 · Answer 3 · edited May 23 '17 at 12:29

1

Since you have tagged your question with entity-framework, I'll assume you are using SQL Server, in which case you may consider using indexed views to "cache" the JOINs without worrying this cache will ever go out of sync - the DBMS will maintain it for you at all times.

For example, you can cache the JOIN between courses, students, discussion walls, questions, replies and badges. So when you want to know which badge belongs to which course, you just retrieve a single row from the indexed view, instead of performing the physical JOIN.

Alternatively, consider redesigning your keys and using identifying relationships to migrate key fields down the foreign key hierarchy, so when querying a child table you can get the key of a non-direct parent without JOINing the tables "in between".

And last but not least, I warmly recommend reading Use the Index, Luke! for fundamental knowledge every developer should have about database performance...

edited May 23 '17 at 12:29

Community

1
1

answered Sep 12 '15 at 20:10

Branko Dimitrijevic

50,809
10
93
167

thanks for the answer, do you imply Views when you mentioned caching? – renakre Sep 13 '15 at 02:52
@erkaner I imply indexed views. "Views" and "indexed views" are somewhat related but not the same thing. – Branko Dimitrijevic Sep 13 '15 at 03:00
I will be checking these, thanks for this valuable information. Now, I do not know who should get the bounty :( they are all great answers.. – renakre Sep 13 '15 at 03:25
1

@BrankoDimitrijevic Indexed views are only Enterprise SQL feature – Akash Kava Sep 13 '15 at 18:35
@AkashKava Nope, they are available in all editions of SQL Server, even LocalDB. The difference is that Enterprise Edition's query optimizer can automatically use indexed view even if you haven't mentioned it explicitly in your query. In other editions, you need to query on indexed view directly. – Branko Dimitrijevic Sep 14 '15 at 08:13

score 0 · Answer 4 · answered Sep 05 '15 at 04:28

I'll post an example here :

public class SchoolEntities : DbContext 
{ 
    public DbSet<Department> Departments { get; set; } 
} 

public class Department 
{ 
    // Primary key 
    public int DepartmentID { get; set; } 
    public string Name { get; set; } 

    // Navigation property 
    public virtual ICollection<Course> Courses { get; set; } 
} 

public class Course 
{ 
    // Primary key 
    public int CourseID { get; set; } 

    public string Title { get; set; } 
    public int Credits { get; set; } 

    // Foreign key 
    public int DepartmentID { get; set; } 

    // Navigation properties 
    public virtual Department Department { get; set; } 
} 

public partial class OnlineCourse : Course 
{ 
    public string URL { get; set; } 
} 

public partial class OnsiteCourse : Course 
{ 
    public string Location { get; set; } 
    public string Days { get; set; } 
    public System.DateTime Time { get; set; } 
}

and that's a small example ... do have any of this information?

thanks for the example, yes I have around 50 entities in total, do not know which to share. I am just trying to learn the best practice in this. — renakre, Sep 05 '15 at 12:30

redundancy in the database vs efficiency in the code

4 Answers4