2

First of all I'm sorry for the overly vague title, however I'm unfamiliar with the proper terminology for a problem like this.

I'm attempting to retrieve a list of page titles from Wiktionary (Wikimedia wiki-based dictionary) where the page must be categorized under English_lemmas, but must not be categorized under English_phrases, English_slang, and English_%_forms (where % is the wildcard).

The two necessary tables are page which contains page information (page_id, page_title), and categorylinks which contains the categories that each page is categorized under.

The relevant structure of page is like so:

+---------+------------------+
| page_id | page_title       |
+---------+------------------+
| 0       | race             |
| 1       | racing           |
| 2       | that's all folks |
| 3       | fire             |
| 4       | psychédélique    |
+---------+------------------+

and the relevant structure of categorylinks is like so: (I've added line breaks for easier reading)

+---------+-------------------------+
| cl_from | cl_to                   |
+---------+-------------------------+
| 0       | English_lemmas          |
| 0       | English_verbs           |

| 1       | English_lemmas          |
| 1       | English_verbs           |
| 1       | English_non-lemma_forms |

| 2       | English_lemmas          |
| 2       | English_phrases         |

| 3       | English_lemmas          |
| 3       | English_nouns           |

| 4       | French_lemmas           |
| 4       | French_adjectives       |
+---------+-------------------------+

where categorylinks.cl_from is a direct reference to page.page_id and categorylinks.cl_to is the title of the category.

I need to select race and fire as they are both categorized under English_lemmas, but not racing or that's all folks as in addition to both being categorized under English_lemmas they are also categorized under English_%_forms and English_phrases respectively, and not psychédélique as it is not categorized under English_lemmas.

Therefore the ideal result would be

+---------+------------------+
| page_id | page_title       |
+---------+------------------+
| 0       | race             |
| 3       | fire             |
+---------+------------------+

Is this something that is feasible to achieve efficiently? And if so how can I achieve this?

I have a fairly basic understanding of SQL (basic SELECT, UPDATE, etc statements), so something like this is far beyond my comprehension.

Prime
  • 2,410
  • 1
  • 20
  • 35
  • For some perspective: `categorylinks` contains 21,868,252 rows, 494,550 of which are `WHERE cl_to='English_lemmas'` – Prime Dec 11 '17 at 17:59

1 Answers1

2

If I understood your requirements correctly:

select *
  from page
  where page_id not in (select cl_from from categorylinks
                         where cl_to like 'English_%_forms'
                            or cl_to like 'English_phrases')
    and page_id in (select cl_from from categorylinks
                      where cl_to like 'English_lemmas')
tonypdmtr
  • 3,037
  • 2
  • 17
  • 29
  • Is it possible to specify that there needs to be a row where `cl_to = 'English_lemmas'` for the page? as otherwise it will select pages from different languages that are categorized under, say, `French_lemmas`. I should have expanded on that in the example. – Prime Dec 11 '17 at 19:32
  • I think my corrected and updated answer should do what you want. – tonypdmtr Dec 11 '17 at 20:05
  • Not only is it perfect but it seems to work well on the large table sizes. Thank you very much! – Prime Dec 11 '17 at 20:09
  • 1
    You only really need two selects: `SELECT page.* FROM page JOIN categorylinks ON page_id = cl_from WHERE cl_to = 'English_lemmas' AND page_id NOT IN (SELECT cl_from FROM categorylinks WHERE cl_to like 'English_%_forms' OR cl_to LIKE 'English_phrases');`. Maybe the query optimizer can figure that out but I wouldn't bet on it. – Tgr Dec 14 '17 at 00:16
  • You're right. I think it would run a bit faster too. – tonypdmtr Dec 14 '17 at 12:36