How do I "link" two tables when the primary key of table_a could exist in none, one or many columns in table_b?

Question

How do I link two tables, People and Messages, when there is no single column in the Messages table that corresponds to the primary key of the People table?

I have two tables: one containing email messages called Messages and one of people called People.

People

| person_id |    email_address   |
| --------  |  ----------------  |
| 001       | `billy@gmail.com`  |
| 002       | `bobby@gmail.com`  |
| 003       | `gregg@gmail.com`  |
| 004       | `janet@gmail.com`  |
| 005       | `boss1@gmail.com`  |

Messages

| msg_id |    from_addr      |   to_addr                            |     cc_addr       |   bcc_addr        | 
| ------ | ----------------- | ------------------------------------ | ----------------- |  -----------      |
| 001    | `billy@gmail.com` | `gregg@gmail.com`, `david@gmail.com` | `boss1@gmail.com` | NULL              | 
| 002    | `bobby@gmail.com` | `gregg@gmail.com`                    | NULL              | NULL              |
| 003    | `janet@gmail.com` | `boss1@gmail.com`                    | `bobby@gmail.com` | `gregg@gmail.com` |

I will be programmatically passed a person_id (say 005 for boss1@gmail.com) and want to be able to query the Messages dataset to get all "involved" emails--where a boss1@gmail.com was sent, received, was cc'd or bcc'd in the email message. So with the dataset above, the query ought to return the rows with msg_id 001 and 003.

I don't know to map the relation between the People and Messages tables.

Say the primary key of of the People table is person_id. There is no single column that I can link that to in the Messages table to act as a foreign key.

I cannot determine how to represent this data correctly.

Am I doing this wrong? What is a better way to represent these two tables with more tables structured differently?

After normalising to FNF the data can be represented as follows.

Messages_FNF

| msg_id |    from_addr      |      to_addr      |     cc_addr       |   bcc_addr        | 
| ------ | ----------------- | ----------------- | ----------------- |  ---------------- |
| 001    | `billy@gmail.com` | `gregg@gmail.com` | `boss1@gmail.com` |  NULL             | 
| 001    | `billy@gmail.com` | `david@gmail.com` | `boss1@gmail.com` |  NULL             | 
| 002    | `bobby@gmail.com` | `gregg@gmail.com` |  NULL             |  NULL             |
| 003    | `janet@gmail.com` | `boss1@gmail.com` | `bobby@gmail.com` | `gregg@gmail.com` |

However, the primary key (that uniquely identifies each row in the table) for Messages_FNF would have to be a composite key combining msg_id, from_addr, to_addr, cc_addr, bcc_addr.

So how do I link the People and Messages tables together using Primary and Foreign keys?

Or must I further normalise or partition these tables such that the People table and Messages_FNF have a meaningful matching field - allowing for easy joins? If so, how?

I've read that the cross join lets you join tables that do not have a common field but I worry that it will result in a VERY large table with unnecessary rows and be computationally expensive.

Is linking the tables with a PK->FK the right way of structuring the data in this problem?

What is your 1 specific researched non-duplicate question? [ask] [Help] [mre] — philipxy, Mar 02 '23 at 05:05
"Am I doing this wrong?" Yes: specifically `Messages` has repeating fields `to_addr`, `cc_addr` -- so it isn't a 'table'. You need first to restructure `Messages` to (at least) 'First Normal Form' -- which'll need 'vertical partitioning'. — AntC, Mar 02 '23 at 07:04
Again: What is your 1 specific researched non-duplicate question? PS Please avoid social & meta commentary in posts. Please don't insert "EDIT"s/"UPDATE"s, just make your post the best presentation as of edit time. Please don't **Yell**. PS Yes or no questions are almost always poor questions & also are almost always not asking what is actually intended. PS It is unhelpful & unclear to ask 'A or B' when A & B are not mutually exclusive and/or not the only options. (And you may think they are when they are not.) — philipxy, Mar 02 '23 at 11:11
Thanks or the advice @philipxy. Here is some relevant research for the question: Microsoft:ERD of an Email App https://social.msdn.microsoft.com/Forums/sqlserver/en-US/8806881d-1091-4e49-905f-fe25a9268e63/erd-of-an-email-app?forum=transactsql ReserchGate - ERD Diagram from Ayat Bu-Suhail et al. 2019 https://www.researchgate.net/figure/Entity-Relationship-Diagram_fig3_335541387 SO: "Best practice to store array-like data in MySQL or similar database?" ^Not helpful as arrays appear to be causing problems when searching. See Zegarek's question-thread for details. — hkh, Mar 02 '23 at 11:31
Please clarify via edits, not comments. PS Summarize & quote with credit what is relevant to asking your question & relate it to the rest of your presentation, don't expect readers to read entire other pages or figure out what is relevant ther & how it is relevant. Please act on all the feedback. PS Especially: 1 question. And not a bunch of wonderings, worryings, etc. This post is way too unfocused & unclear. — philipxy, Mar 02 '23 at 11:41
Does this answer your question? [Is storing a delimited list in a database column really that bad?](https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad) — philipxy, Aug 25 '23 at 13:22

score 0 · Answer 1 · answered Mar 01 '23 at 19:09

0

The Email_Address on People Table is same as From_Address on Messages Table, correct? If it is correct, you can simply join on these two fields. To join the tables, you don't necessarily join on only primary key. You can join on any related fields.

answered Mar 01 '23 at 19:09

Cody

21
3

1

No, not correct: OP wants to link to _any_ of `from_addr` or `to_addr` (repeating field) or `cc_addr` (repeating field). Because `Messages` has repeating fields, it is not a relation, so incorrect to describe it as a 'table'. – AntC Mar 02 '23 at 06:59

Zegarek · Answer 2 · 2023-03-02T09:17:57.390

-1

You can compare the e-mail from your target people row to ANY e-mail in an array combining the e-mail addresses in the messages table: online demo

select  m.* 
from    messages as m 
   join people   as p 
   on p.email_address = any(m.from_addr||
                            m.to_addr  ||
                            m.cc_addr  ||
                            m.bcc_addr ) 
where p.person_id='005';

The first || is anycompatible||anycompatiblearray operator that adds an element to array, and the rest of || are anycompatiblearray||anycompatiblearray that merge arrays. You end up with one array, and text = any( text[] ) lets you check if your e-mail address is in there.

An intermediate table between messages and people could hold an attribute saying how they're related (from/to/cc/bcc), plus a single address. This would be easier to work with and index: demo

It might be worth looking into other ways to store an e-mail address instead of a plain text type.

edited Mar 02 '23 at 09:17

answered Mar 01 '23 at 19:37

Zegarek

6,424
1
13
24

`to_addr`, `cc_addr`, etc have repeating fields, so a bare `=` test won't work. – AntC Mar 02 '23 at 07:02
Suppose `People.email_address` contains `"bob@gmail.com"` and `Messages.to_addr` includes `"billybob@gmail.com"`. Because you're doing a (sub-)string search, not a proper whole-field comparison, `any` will return a false positive. – AntC Mar 02 '23 at 10:55
... Or suppose one field ending `"someone@gmail.com"` concatenated to a field beginning `"abel@gmail.com"`. Now you're going to get a false positive match for `"mabel@gmail.com"`. – AntC Mar 02 '23 at 11:24
@AntC Here's a demo showing that an array with `billybob@gmail.com` in it is *not* matched with `bob@gmail.com`: [demo](https://dbfiddle.uk/nwj8FAHS). Again, you seem to be assuming it's text-to-text concatenation, then some sort of text-to-concatenated-text matching. Meanwhile it's text **array** concatenation, merging arrays of text values without affecting the values themselves, then text-to-text one-to-many comparison using the `ANY` operator. – Zegarek Mar 02 '23 at 11:43
Then I apologise: I'm trying to guide the OP towards using conventional techniques. I'd call `array`s an advanced feature not supported by all vendors. If arrays are allowed within scope of the question (it doesn't say so), your approach will work. It won't scale to more complex 'repeating groups'. Normalisation is the discipline I'd expect this kind of question to be looking for. – AntC Mar 02 '23 at 12:12

AntC · Accepted Answer · 2023-03-03T06:18:03.540

To put OP out of their misunderstanding on the main point:

So how do I link the People and Messages tables together using Primary and Foreign keys? ...

I've read that the cross join lets you join tables that do not have a common field ...

To "link tables together" does not need going via keys -- not necessarily any key in any table. The most common linkages in a 1:many join are from Foreign key in the many side to Primary key in the 1 side. That's why those relationships carry those names.

That Primary, Foreign key business is a result of 'normalising' the data model, aimed at exactly avoiding 'update anomalies' and making joining straightforward.

Joins going via non-key fields are sometimes call 'cross joins' to distinguish them -- but they're really all just joins, and you can use sql keyword join ... on ....

OP has now revised the q to show Messages_FNF. To repeat: no need to go via keys, so you could join People.email_address to each/any of the three fields holding email addresses.

Whilst this design is in 1NF (under a very generous interpretation), it is hardly a solution with all that content repeated, including repeating a load of useless NULLs. You'll suffer update anomalies (see the wikipage); you'll also indeed get a blowup in numbers of rows.

You need to vertically split off the columns that have repeating (or nullable) fields. Retain from_addr in the Messages table (because it must have exactly one from address). Create three extra tables each with a compound key of {message_id, to_addr} etc, in which message_id is a Foreign key to (REFERENCES) Messages.

Now your sql can't use join ... on ... because you're looking for a match in any of the four tables, not all. You'll need a disjunctive where ... = ... or ... = ... or ... query. Exercise left for the reader: that's a straightforward piece of sql.

Addit: in response to comment

Each of these three table will have rows upon rows of email addresses that already exist in the People table.

Do they already exist in People table? You didn't say that in the original q. I get messages from all sorts of people I've never emailed before; I certainly don't know who else might get bcc'd on messages I receive.

So if you have a business rule that anybody's email address appearing anywhere on any message is to automatically get an entry in the People table, yes your auxiliary tables could carry msg_id, person_id. You'll still have "rows upon rows" of person_id.

You'll still have to query all those Messages auxiliary datasets to search for person_id; and it still won't be a whole Primary key of any table. (You seem to be worried about efficiency/effort: I'd guess not much saving, on modern DBMSs.)

Unless i've misunderstood your solution... If I vertically split off the columns that have repeating fields, won't that give me three copies of essentially the same table? E.g. the `Message_To`table will contain the fields `msg_id` and `to_addr`. The `Message_Cc` will contain the fields `msg_id` and `cc_addr`, similarly for the third `Message_Bcc` table. Each of these three table will have rows upon rows of email addresses that already exist in the `People` table. Is that not another form of redundancy or is this intentional to avoid looping back to the `People` table? — hkh, Mar 02 '23 at 17:24

How do I "link" two tables when the primary key of table_a could exist in none, one or many columns in table_b?

3 Answers3