1

I'm trying to build a searchable database of the languages spoken by my users.

For example I might have

$john = array("english", "french", "spanish");
$jack = array("french", "spanish");
$jill = array("english", "spanish");

And I want to save them to a MySQL database so that I can later run something along the lines of (pseudo code)

SELECT * FROM users WHERE spoken_languages = "french" and "spanish"

I know if I had speaks_english, speaks_french, and speaks_spanish columns then I could search for

SELECT * FROM users WHERE speaks_french = "true" and speaks_spanish = "true"

But the scalability of adding a new column every time I encounter a new language is not very good. I've considered a table such as

john | english
john | french
john | spanish
jack | french
jack | spanish
jill | english
jill | spanish

Because at least then to get back languages spoken by a user I could just run

SELECT * FROM spoken_languages WHERE user = "jack"

But in order to search this for people that speak both french and spanish I would need to query for all of the users that speak french, all of the users that speak spanish and then calculate the intersections. That seems horribly inefficient.

So I ask you, how can I save this array of spoken languages so that I can search the database later without destroying the server?

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
Nick Chapman
  • 4,402
  • 1
  • 27
  • 41
  • 2
    your suggested table layout is the recommended one. you could tweak it to use user_id and language_id to make it smaller –  Jun 11 '14 at 21:23
  • 4
    A user table, a language table and a user / language table walk into a bar.... and a many to many relationship ensues. – Jay Blanchard Jun 11 '14 at 21:24
  • 1
    Use the structure you considered, databases are good at doing the intersection thing as long as you know enough SQL – RiggsFolly Jun 11 '14 at 21:25
  • @RiggsFolly can you provide me with a sample? I'm not that great at SQL – Nick Chapman Jun 11 '14 at 21:25
  • Read up on many-to-many relationships. You'll end up with a `users` table, `languages` table, and a `users_languages` relationship table. – jszobody Jun 11 '14 at 21:30
  • Here's an example: http://stackoverflow.com/questions/12461194/filter-using-a-many-to-many-table-mysql – jeroen Jun 11 '14 at 21:31
  • my demo http://sqlfiddle.com/#!2/23824f/1 –  Jun 11 '14 at 21:39
  • @NickChapman There's no "search" done, you are assuming an implementation. The DBMS finds the rows that match the criteria you give, that's its business. – philipxy Jun 12 '14 at 07:14
  • 1
    @JayBlanchard No, no, wait-- A user, a language and a DBMS walk into a bar... ...blah blah blah blah... ...and the bartender says, "...from DIFFERENT FIELDS but the SAME TABLE would GET INTO A ROW!" – philipxy Jun 12 '14 at 07:23

3 Answers3

5

You have the right solution in your question, with the person_language table that looks like this

john | english
john | french
jack | spanish

You can query it like so.

SELECT person
  FROM person_language
 WHERE language IN ( 'english', 'spanish')
 GROUP BY person
HAVING COUNT(*) = 2

Put an index on (language, person) and this will scale up fine.

If you want everybody who speaks Spanish and at least one other language you can do this.

 SELECT a.person
   FROM person_language AS a
   JOIN ( SELECT person
            FROM person_language
           GROUP BY person
          HAVING COUNT(*) >= 2
        ) AS b ON a.person = b.person
  WHERE a.language = 'spanish'

This uses a JOIN directive to take the intersection of the people who speak spanish with the people who speak two or more languages.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • Unrelated question, but would there be any significant performance differences if this query was written with the WHERE clause as `WHERE language = 'english' || language = 'spanish'`? – echochamber Jun 11 '14 at 21:42
  • would not return the results required that way –  Jun 11 '14 at 21:43
  • @echochamber I believe that's the way MySQL runs the query. Either way, use explain on both queries to see any differences. – Joel Mellon Jun 11 '14 at 21:43
  • ...and I'm most likely wrong: http://stackoverflow.com/questions/782915/mysql-or-vs-in-performance – Joel Mellon Jun 11 '14 at 21:49
3

You can do this query in an efficient way with a self-join:

SELECT * FROM users u1
JOIN users u2 USING (user)
WHERE (u1.lang, u2.lang) = ('french', 'spanish')

See examples of relational division solutions in my presentation, SQL Query Patterns, Optimized.

If you have the right indexes, my tests show this self-join solution is about 20 times faster than the GROUP BY solution.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
0

I'd go with the three table setup like so

CREATE TABLE languages
(
  `id` int not null auto_increment primary key, 
  `language` varchar(32) unique
);
CREATE TABLE users
(
  `id` int not null auto_increment primary key,
  `name` varchar(32)
);
CREATE TABLE user_language
(
  `user_id` int, 
  `language_id` int,
  primary key (user_id, language_id)
);

IMHO if you're not having millions of users and all possible languages and looking for flexibility and not fighting for miliseconds, especially if you check for more than 2 languages at a time, you can achieve it by using MAX() or SUM() aggregates in HAVING clause.

Here some sample queries for you:

-- Speaks both French AND Spanish
SELECT u.name
  FROM user_language ul JOIN languages l
    ON ul.language_id = l.id JOIN users u
    ON ul.user_id = u.id
 GROUP BY u.id
HAVING MAX(l.language = 'french') = 1
   AND MAX(l.language = 'spanish') = 1;

Output:

| NAME |
|------|
| John |
| Jack |
-- Speaks both French OR Spanish
SELECT u.name
  FROM user_language ul JOIN languages l
    ON ul.language_id = l.id JOIN users u
    ON ul.user_id = u.id
 GROUP BY u.id
HAVING MAX(l.language = 'french') +
       MAX(l.language = 'spanish') > 0;

Output:

| NAME |
|------|
| John |
| Jack |
| Jill |
-- Speaks any language French OR Spanish BUT NOT English
SELECT u.name
  FROM user_language ul JOIN languages l
    ON ul.language_id = l.id JOIN users u
    ON ul.user_id = u.id
 GROUP BY u.id
HAVING MAX(l.language = 'french') +
       MAX(l.language = 'spanish') > 0
   AND MAX(l.language = 'english') = 0;

Output:

| NAME |
|------|
| Jack |
-- Speaks any language but English
SELECT u.name
  FROM user_language ul JOIN languages l
    ON ul.language_id = l.id JOIN users u
    ON ul.user_id = u.id
 GROUP BY u.id
HAVING MAX(l.language = 'english') = 0;

Output:

| NAME |
|------|
| Jack |
-- What languages does Jack speak
SELECT l.language
  FROM user_language ul JOIN languages l
    ON ul.language_id = l.id JOIN users u
    ON ul.user_id = u.id
 WHERE u.name = 'Jack';

Output:

| LANGUAGE |
|----------|
|   french |
|  spanish |
-- How many languages do users speak
SELECT u.name, COUNT(*) no_of_languages
  FROM users u LEFT JOIN user_language ul
    ON u.id = ul.user_id
 GROUP BY u.id;

Output:

| NAME | NO_OF_LANGUAGES |
|------|-----------------|
| John |               3 |
| Jack |               2 |
| Jill |               2 |
-- How many users do speak a particular language
SELECT l.language, COUNT(*) no_of_users
  FROM languages l LEFT JOIN user_language ul
    ON l.id = ul.language_id
 GROUP BY l.id;

Output:

| LANGUAGE | NO_OF_USERS |
|----------|-------------|
|  english |           2 |
|   french |           2 |
|  spanish |           3 |

Now in a real application you most likely won't be using language or user names but rather deal with ids that are coming from your UI (drop down boxes or whatever). Therefore you'll be able to eliminate one join from the equation and your query would look rather like this

-- Speaks both French AND Spanish with Ids
SELECT u.name
  FROM user_language ul JOIN users u
    ON ul.user_id = u.id
 GROUP BY u.id
HAVING MAX(ul.language_id = 2) = 1
   AND MAX(ul.language_id = 3) = 1;

Here is SQLFiddle demo

peterm
  • 91,357
  • 15
  • 148
  • 157