7

I have some data that I need to put in a PostgreSQL database. These data are related to schools. So there are a LOT of attributes relating to the school, mostly small integers, floats or small texts. And all the data changes yearly. So I'm creating an entity called YearlyData and putting the attributes there. But the thing is, the number of attributes are around 50-60. Now these cannot be normalized because they are plain attributes of the school itself. So I ethically cannot separate them into tables. But I'm not sure if this would harm my performance.

I can try to categorize those data and put them in separate tables and point to them from the YearlyData table. But then trying to search schools with 20-30+ parameters would cause insane amount of joins, I'm guessing. I'm also not sure if that'd harm my performance.

Any expert advice?

Bibhas Debnath
  • 14,559
  • 17
  • 68
  • 96
  • I don't think 50-60 columns is going to cause problems unless they're really wide columns, like lots of text and blob data. What sort of data are we talking about? If it's mostly integers, dates, etc. then it's probably fine. – David Jul 27 '13 at 19:50
  • yeah. mostly integers, floats or small texts. – Bibhas Debnath Jul 27 '13 at 19:51
  • Should be fine, most likely. You can get actual metrics by testing it, of course. Create the table, fill it with tons of random junk data, run through some scenarios in the application and measure the performance. Additionally, how many records do you expect this table to have? Indexing and optimization is going to come into play at some point, and if there's a lot of data then it might make sense to break up some of it into tables optimized for searches. – David Jul 27 '13 at 19:54
  • There'd be about 100,000 records for now. I don't really have the experience with that much data, so having doubts. I guess the best I can do is what you said. Test it out. – Bibhas Debnath Jul 27 '13 at 19:59
  • Testing it is always going to be the best source of information. 100,000 records isn't really a *lot* of data, so again it's probably going to be fine. If you end up putting a lot of indexes on a lot of those columns then you might start getting diminishing returns in performance, though. At that point you might need to break it into a couple of tables, where searches are run against search-optimized tables with specific indexes and the results joined with the other tables for the remaining data. It's something to think about, but not necessary until there are real metrics to prove it. – David Jul 27 '13 at 20:06
  • @JustBob Actually, that's not quite right for PostgreSQL. PostgreSQL *cannot* store rows across multiple pages *at all*. However, you can still store rows much larger than one page because many columns [can be stored out-of-line and optionally compressed in `TOAST` tables](http://www.postgresql.org/docs/current/static/storage-toast.html). The storage required in the main table row is tiny even for huge values. So: the main row can't split over an 8k page boundary, limiting the *number* of columns, but out-of-line `TOAST` storage means you can store very big values in columns. – Craig Ringer Jul 28 '13 at 10:19
  • @Craig Ringer yeah I had tried to delete the comment and reword to say rows within a page have to be within the 8kb unless you go to TOAST for the datatypes that go there. – Kuberchaun Jul 28 '13 at 14:33
  • I need to get a grip of the concept of TOAST. – Bibhas Debnath Jul 28 '13 at 15:18

2 Answers2

4

PostgreSQL stores rows on what is called data pages that are 8kb in size. You can think of these as buckets that have a finite size limit. The disadvantage to wide rows is that the database can fit less row on a data page. It's faster for the database engine to bring back 1,000 rows off one page than it is to bring 1,000 rows that are spread over multiple pages. In that case one read vs 1,000 with disk IO being your enemy. This is something to be aware of not to avoid. Wide tables are often needed you can live with the overhead. In your case you will use 240 bytes per a row roughly(4 bytes per integer * 60 rows).

Kuberchaun
  • 29,160
  • 7
  • 51
  • 59
  • Do you mean 240 bytes? The key-row table is a good idea to store data. But I need to perform heavy queries on them. How did you handle that? Where key is this and value is that do this? – Bibhas Debnath Jul 28 '13 at 06:55
  • 4
    This technique is called ["Entity Attribute Value"](http://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model) and is generally considered a DB-antipattern or even a performance timebomb. The later due to the fact that during testing (usually with small datasets) everything is fine, pruduction _start_ fine (small table, all tuples of one parent are in the same page...) but degrades heavily in the future. And by then the table is to big to do a regular reorg-job (due to time constraints). – A.H. Jul 28 '13 at 08:17
  • 1
    It gets ugly fast when you're trying to do conditional queries on several attributes. You get horrible inefficient lists of many, many joins. It's considered an anti-pattern for a reason. Not to say you shouldn't use it sometimes, the cost/benefit just has to be weighed carefully. – Craig Ringer Jul 28 '13 at 09:57
  • @A.H. You forgot to add "There are certain cases where an EAV schematic is an optimal approach to data modelling for a problem domain" it all depends on many factors and testing it out for your use case in the end. – Kuberchaun Jul 28 '13 at 14:45
  • @JustBob: These cases are very few and - nowadays - better solved by NoSQL databases or by using non-standard extension as hstore for PostgreSQL. In any live case I have seen the EAV has been implemented not because the domain required it but because some intermediate framework has found (again) a "great" idea how to be ultra-flexible. Hence I'm a bit pointy about that. – A.H. Jul 28 '13 at 17:41
  • @A.H."some intermediate framework has found (again) a "great" idea how to be ultra-flexible" 100% agree. If it's the next great discoverty and out of the box it tends to let out a stink in the long run. – Kuberchaun Jul 28 '13 at 17:48
  • @A.H. Ah, the wonders of the inner system effect in RDBMses. "We invented this way to make a totally flexible user defined DB on top of this icky rigid inflexible system of tables..." – Craig Ringer Jul 28 '13 at 23:16
4

There are a few things to consider here:

  • Does the list of attributes change significantly over time
  • Does the list of attributes require custom user-defined attributes
  • Are there different attributes for different schools (i.e. many attributes only apply to one or a few schools)?

If any of these are true, you might think about a properties store approach like EAV, hstore, json fields, xml fields, etc.

If not - if you have a fairly static list of properties where most of them make sense for most of the rows - then there's not really a problem with having them as 60 individual columns. It'll be easier to add indexes for commonly searched for sets of attributes, including partial and composite indexes, etc, and searches - particularly those for many different attributes - will be much faster.

See also: Database design - should I use 30 columns or 1 column with all data in form of JSON/XML?

There's also a compromise option available to you: A main table for the most important details you look up a lot, plus side-tables for logical groupings of attributes. Say:

yearly_summary (
    yearly_summary_id serial primary key,
    school_id integer,
    total_students integer,
    ...
) 

plus

yearly_student_stats(
    yearly_summary_id integer primary key references yearly_summary(yearly_summy_id) on delete cascade,
    ...
)

etc. The integer primary key that's also a foreign key means you have an enforced 1:1 (optional) relationship to the other table. This approach can be useful if you have a few logical groupings of attributes that you can cluster into side-tables.

I'd also be surprised if a little more thought didn't reveal things that do make sense to normalize. Do you have year7_blah, year8_blah, year9_blah etc columns? If so: Great candidate for normalization.

Community
  • 1
  • 1
Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • I am currently testing using many columns. Did some logical grouping. If it works out, I think I'll go with that. And no, attributes are like has `has_computers`, `number_of_computer`, `number_of_boys`, so on. I'm sure I can divide them in more logical groups, but that's not really normalizing, right? – Bibhas Debnath Jul 28 '13 at 15:14
  • @Bibhas It's not really normalizing, but DB work isn't always about purity, sometimes it's about compromise for the sake of performance, client requirements and real-world implementation limitations. In this case the "pure" way sounds like one big table and if you're not going to be adding attributes all the time or having user defined attributes that's probably how I'd do it. If I found performance issues with that down the track I'd split it up into sub-tables if/when I needed to. – Craig Ringer Jul 28 '13 at 23:18
  • just asking, is there any advantages of using NoSQL DBMSes in such situations? – Bibhas Debnath Jul 29 '13 at 06:15
  • @Bibhas Sure, and disadvantages to match. Pick your poison. You throw out a lot of functionality to get a simplified feature set with a more dynamic schema, no matter which style you choose. There are jobs that'd be an appropriate choice for, but I struggle to imagine this being one of them. – Craig Ringer Jul 29 '13 at 06:23