0

Here is my current database structure:

CREATE TABLE `books` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(100) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  `year` year(4) NOT NULL DEFAULT '0000',
  `author` varchar(100) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  UNIQUE KEY `title` (`title`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1;

CREATE TABLE `chapters` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `book_id` int(10) unsigned NOT NULL DEFAULT '0',
  `number` int(10) unsigned NOT NULL DEFAULT '0',
  `title` varchar(100) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `book_id` (`book_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1;

ALTER TABLE `chapters`
  ADD CONSTRAINT `chapters_ibfk_1` FOREIGN KEY (`book_id`) REFERENCES `books` (`id`) ON DELETE CASCADE ON UPDATE CASCADE;

CREATE TABLE `pages` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `book_id` int(10) unsigned NOT NULL DEFAULT '0',
  `chapter_id` int(10) unsigned NOT NULL DEFAULT '0',
  `number` int(10) unsigned NOT NULL DEFAULT '0',
  `text` text COLLATE utf8_unicode_ci NOT NULL,
  `words` int(10) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `book_id` (`book_id`),
  KEY `chapter_id` (`chapter_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1;

ALTER TABLE `pages`
  ADD CONSTRAINT `pages_ibfk_1` FOREIGN KEY (`book_id`) REFERENCES `books` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
  ADD CONSTRAINT `pages_ibfk_2` FOREIGN KEY (`chapter_id`) REFERENCES `chapters` (`id`) ON DELETE CASCADE ON UPDATE CASCADE;

The structure is quite simple... basically I'm extracting book text page by page and storing everything into my database, which is organized into a book>chapter>page system. I tried to make it as flexible as possible so I can easily aggregate data on the point of view of the whole book or by chapter... but if you think I could have made something better I'm open to any suggestion!

Now, I would like to allow users to perform keyword searches inside the books... so that they could search for all the occurrences of a single word, or even a phrase, inside the book they choose from a dropdown.

My web server is not located on the same machine that stores the MySQL database (technical issue that I cannot get rid of in the short run)... so in order to avoid huge data traffic between the two machines I would prefer to run the text searches through SQL queries. Retrieving all the pages and analyzing them with PHP would translate into 5-10 Mb of data every time.

Now my questions are:

  • Is it possible to perform this kind of process using only query commands (LIKE, MATCH, REPLACE, etc...)?
  • I would to obtain results formatted by page in the following way: [page 1 | 0 occurrences], [page 2 | 1 occurrence], [page 3 | 1 occurrence], [page 4 | 2 occurrences]... is that possible?
  • Do you think it would be a good idea to strip spacing characters (line breaks, tabs and such) and punctuation characters from the pages text before storing it into the field pages.text?

Thanks for your help!

Tommaso Belluzzo
  • 23,232
  • 8
  • 74
  • 98

2 Answers2

1

Try this:

SELECT p.number, ROUND (   
        (
            LENGTH(p.text)
            - LENGTH( REPLACE ( p.text, "{your string here}", "") ) 
        ) / LENGTH("{your string here}")        
    ) AS count  
FROM pages p
JOIN
chapters c ON c.id = p.chapter_id
JOIN
books b on b.id = c.book_id
WHERE b.title = {your book title}
GROUP BY p.number

-- solution inspired by:
-- http://stackoverflow.com/questions/12344795/count-the-number-of-occurences-of-a-string-in-a-varchar-field

You could strip whitespace but you may run into a couple of problems:

  • You would need to strip your search queries as well (not really a problem, just extra work)

  • More importantly, if you ever want to show the user the full text of the page, or even just a excerpt, you wouldn't know where the whitespace characters were in the first place.

Jay S.
  • 1,318
  • 11
  • 29
  • I'm never going to display the text of a page in any circumstance... anyway I was not talking about whitespaces... only carriage returns, tabs and such... stripping whitespace makes no sense in any case. Because if I preserve line breaks I'm pretty sure that a few searches are not gonna be found even if they should. – Tommaso Belluzzo Dec 31 '15 at 18:56
  • I see what you're saying. In that case, I think it would be ok if you stripped newlines. Did that query work? @Zarathos – Jay S. Dec 31 '15 at 18:58
  • Gonna test soon. Also... is there a way to check for word boundaries like if it was a regex? – Tommaso Belluzzo Dec 31 '15 at 19:01
  • Word boundaries where? @Zarathos – Jay S. Dec 31 '15 at 19:15
  • Well for example if someone searches "here I am" that function would also return matches like "there I am"... and I would like to add an option to go for exact matches instead of partial. – Tommaso Belluzzo Dec 31 '15 at 19:17
  • I see. Well you can use regex with mysql. See the docs `http://dev.mysql.com/doc/refman/5.7/en/regexp.html#operator_regexp` @Zarathos – Jay S. Dec 31 '15 at 19:31
1

Don't worry about MySQL and the app being on different machines. (Of course, you need to use TCP, not "localhost".) The "big players" split like that.

Have FULLTEXT(text) on pages, then say

FROM pages AS p
JOIN book AS b ON ...
WHERE MATCH (p.text) AGAINST (...)
  AND b.id = ...

Other notes:

With a suitable GROUP BY, you can come close to page + count, like you requested.

Do you really want to say "0 occurrences for some page? What if the 'book' is 500 pages long; do you really want 500 lines of output?

Note the limitations of FULLTEXT searches (words only, stemming, minimum word size, "stop" words, etc) and deal with them if the user fails to follow such.

Sometimes I do this: If the user enters a * or % in the query, I use REGEXP or LIKE instead of FULLTEXT, and warn the user that it will be slower.

The FULLTEXT index will be used first when MATCH is present, then it will filter on book. With LIKE or REGEXP, it will filter on book_id before doing the tedious search. Note: This implies "constructing" the query dynamically in PHP (or whatever language).

Doing all (or most) of the work leads to much less network traffic, which was one of your concerns.

No "stripping" needed. FULLTEXT needs the word boundaries.

Count per page:

SELECT p.number AS 'PageNumber',
       COUNT(*) AS 'Occurrences'
    ...
    GROUP BY p.id
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Again, for what concerns stripping I was only thinking about stripping \n, \t, quotes, dots, colons, semicolons, 3 dots, etc... so I only have words. Let's say I insert a Tolkien's book into the database... and people search "fly, you fools", but in my database there is a line break such that the phrase is splitted like so "fly,\nyou fools"... will FULLTEXT find it anyway? And again, let's say people search "boot" and only wants the exact march and not also "boots"... how should I change my query? – Tommaso Belluzzo Jan 01 '16 at 01:04
  • About 0 occurrences well no... I can skip that... I will just assume that the rows not returning a match have no occurrences so I'll initialize my array with a zero-fill and only insert values where matches are found. But it's still not clear how I can find return the number of matches for each page in your answer. – Tommaso Belluzzo Jan 01 '16 at 01:07