0

One of the core differences between SQL and the relational algebra is that SQL operates over bags whilst the relational algebra operates over sets.

Are there any performance benefits to designing SQL this way? Could a pure relational algebra which operates strictly over sets ever compete with SQL on performance? For which relational operators is ensuring uniqueness of rows expensive?

Performance here means strictly the amount of time it takes to execute queries.

Nick
  • 920
  • 1
  • 7
  • 21
  • Sets are better. Ultimately because sets are the right data structure & bags are not sets. [A relation holds the set of rows that meet some criterion.](https://stackoverflow.com/a/23842061/3404097). Specifying a bag of rows instead has no benefit & only impedes both our saying what we want & the DBMS from getting us what we want. (The relational model has nothing to say re implementation/performance per se, but the more the DBMS knows what we want the more it can optimize.) Bags are a mistake. – philipxy May 26 '19 at 17:22
  • Between your title & body you ask 3 yes/no questions. But when the first 2 are yes the third is not & vice versa. So we can't answer yes/no. Also: Likely you don't mean "benefits to implementing SQL this way" but "benefits to designing SQL this way". – philipxy May 26 '19 at 17:22
  • Possible duplicate of [Why does SQL standard allow duplicate rows?](https://stackoverflow.com/questions/30767562/why-does-sql-standard-allow-duplicate-rows) – philipxy May 26 '19 at 17:25
  • @philipxy my question is different? That's saying why doesn't sql enforce primary keys on all tables. That's different to only operating on sets, where not only are tables sets, but also all the resulting sets you created from the relational operators are sets (e.g. unioning the same table twice will return the same set). That's stronger than the rule cited in that question. – Nick May 29 '19 at 19:40
  • @philipxy removing duplicates is an expensive operation no? On the one hand you can optimize the relational algebra better with sets, on the other you need to check for uniqueness more. – Nick May 29 '19 at 19:43
  • "allowing duplicate rows" = "operates over bags". Reasons why SQL uses bags vs sets would include performance. PS Your comment is not clear. PS See my comment on the duplicate question re confusions in it. (But overall it is re duplicates not PKs.) – philipxy May 29 '19 at 19:47
  • idk it should be obvious just read them. One is talking about performance and the other is talking about the sql standard, which could have a whole myriad of reasons for doing whatever they have done. I just want to know the performance properties of bags vs sets from the perspective of a database implementer. – Nick May 29 '19 at 19:50
  • "is expensive" says nothing about what choice to make. All engineering is tradeoffs. "better performance" *is about* "a whole myriad of reasons". Moreover since bags & sets give two different system behaviours, "better performance" doesn't even make sense unless you are talking about "performance" for getting stuff done by using a DBMS, which includes the ergonomics of reasoning & coding in sets (the relational model) vs not (SQL). – philipxy May 29 '19 at 20:00
  • well if you want to go into these performance reasons/tradeoff in an answer that would be very helpful :) – Nick May 30 '19 at 20:16
  • Well... Your edit now makes the answer trivial--obviously since every set is a bag, for input & queries limited to sets sets can be implemented at least as quickly as bags. Really, it's not clear what question you want to ask. But if you were clear I still think it's a special case of the duplicate, especially since come confused vague notion of "performance" wrongly motivated the designers of SQL to not use sets. Indeed the 2 correct answers (nvogel & ErwinSmout) unhelpfully basically say, 'SQL is bags "because" that's what they picked--wrongly'. But do keep trying to pin down your question. – philipxy May 30 '19 at 22:29
  • PS Since if in an alternate implementation of some *fixed* functionality some queries could be faster & some slower, "the amount of time it takes to execute queries" still is not a clear criterion & so you still haven't defined "better performance". (I told you--tradeoffs.) PS I might write an answer to the duplicate, but it's a big essay that comes down to "because of the benefits of the RM (relational model)--and relations are sets". Slightly longer is my first comment, slightly longer is its linked answer, then there is a tree of links from it. But just see any textbook RM introduction. – philipxy May 30 '19 at 22:53
  • the answer isn't trivial. It's not obvious whether unioning which respects keeping uniquness of entries is going to be as quick as not caring about it. As I've said before, i don' care about what the sql designers think (please read my question again) and i make no reference to them. Imagine your implementing a database with RA as the query language. Option A is RA operators remove duplicates Option B is it doesn't. What performance characterisitics (time taken, memory use, disk seeks, etc) are going to differ foreach option? – Nick Jun 01 '19 at 15:32

0 Answers0