1

A while ago I posted this question asking if it's possible to convert text to HTML links if they match a list of terms from my database.

I have a fairly huge list of terms - around 6000.

The accepted answer on that question was superb, but having never used XPath, I was at a loss when problems started occurring. At one point, after fiddling with code, I somehow managed to add over 40,000 random characters to our database - the majority of which required manual removal. Since then I've lost faith in that idea and the more simple PHP solutions simply weren't efficient enough to deal with the amount of data and the quantity of terms.

My next attempt at a solution is to write a JS script which, once the page has loaded, retrieves the terms and matches them against the text on a page.

This answer has an idea which I'd like to attempt.

I would use AJAX to retrieve the terms from the database, to build an object such as this:

var words = [
    {
        word: 'Something',
        link: 'http://www.something.com'
    },
    {
        word: 'Something Else',
        link: 'http://www.something.com/else'
    }
];

When the object has been built, I'd use this kind of code:

//for each array element
$.each(words,
    function() {
        //store it ("this" is gonna become the dom element in the next function)
        var search = this;
        $('.message').each(
            function() {
                //if it's exactly the same
                if ($(this).text() === search.word) {
                    //do your magic tricks
                    $(this).html('<a href="' + search.link + '">' + search.link + '</a>');
                }
            }
        );
    }
);

Now, at first sight, there is a major issue here: with 6,000 terms, will this code be in any way efficient enough to do what I'm trying to do?.

One option would possibly be to perform some of the overhead within the PHP script that the AJAX communicates with. For instance, I could send the ID of the post and then the PHP script could use SQL statements to retrieve all of the information from the post and match it against all 6,000 terms.. then the return call to the JavaScript could simply be the matching terms, which would significantly reduce the number of matches the above jQuery would make (around 50 at most).

I have no problem with the script taking a few seconds to "load" on the user's browser, as long as it isn't impacting their CPU usage or anything like that.

So, two questions in one:

  • Can I make this work?
  • What steps can I take to make it as efficient as possible?

Thanks in advance,

Community
  • 1
  • 1
turbonerd
  • 1,234
  • 4
  • 27
  • 63

5 Answers5

2

You can cache the result on insert.

Basically, when someones insert a new posts, instead of just inserting it into the DB, you run your replace process.

If your posts are stored like this in the DB

Table: Posts
id        post
102       "Google is a search engine"

You can create another table

Table: cached_Posts
id       post_id   date_generated   cached_post                             
1        102       2012-10-10       <a href="http://google.com">Google</a> is a search engine"

When you retrieve the post, you check if it exists first in the cached_Posts table.

The reason why you should keep the original is maybe down the road you might add a new keyword to replace. All you will have to do is remake your cache.

By doing it this way, no client-side JS is required, and you will only have to do it once per post, so your results should come up pretty quick.

Maktouch
  • 3,117
  • 20
  • 21
  • Thanks Aktee. Sorry to be a pain but the "message" terminology is making this situation difficult for me to envisage. I have "terms" (i.e. species name of a fish, or a glossary term) and I have "posts" which I'd like to search content for (often up to 10,000 characters) and replace the "terms" with links. Could you update your post using this terminology to make sure I'm grasping it correctly? – turbonerd Oct 11 '12 at 14:37
  • Hmmm interesting. That might be quite complicated to do on my WP installation unfortunately. Thanks though Aktee, I appreciate your input. – turbonerd Oct 11 '12 at 14:52
  • ahhhh you're using WP. There's caching solutions available for WP! – Maktouch Oct 11 '12 at 15:00
  • Yes, there are, and I'm using one of the best available. However, it isn't quite as flexible as the solution you've suggested *and* even using caching I couldn't get the XPath PHP solution working quickly enough. Actually it wasn't even a speed thing, it was a memory issue. – turbonerd Oct 11 '12 at 15:05
1

As invertedSpear says, you shouldn't necessarily give up on PHP just because you haven't been able to make it work. A Javascript solution, whilst relieving the load on your server may well end up seeming slower to the end-user. You can always cache a server-side solution as well, which you can't really do client-side.

With that said, these are my thoughts on your Javascript. I've not attempted anything like this myself so I can't comment on whether you can make it work but there are a couple of things which I can see as potentially being problematic:

  1. jQuery's $.each() function, whilst very useful, is not very efficient. Try running this benchmark and you'll see what I mean: http://jsperf.com/jquery-each-vs-for-loops/9

  2. If you're going to run $('.message') on each iteration of the loop, you're going to potentially be doing a lot of fairly expensive DOM traversal. You should cache the results of this operation in a variable if possible before you start looping over your words

  3. Are you relying on each instance of your 'search' text being encapsulated by whatever element has the class message and having no other text surrounding it? Because that's what your if ($(this).text() === search.word) { line implies. In your other question you seemed to suggest that you'd have more text surrounding the terms you want to replace, in which case you'll probably need to look at regexes to perform the replacement. You'll also need to make sure the text isn't contained within an <a> tag.

josno
  • 96
  • 5
1

Here's something relatively simple I came up with. Sorry, no thorough testing, neither performance testing. I assure it can be optimized further, I just didn't have the time to do it. I put some comments though to make it simpler http://pastebin.com/nkdTSvi6 It might be a tad to long for StackOverflow, but I'll post it here anyway. The pastebin is for more comfort viewing.

function buildTrie(hash) {
    "use strict";
    // A very simple function to build a Trie
    // we could compress this later, but simplicity
    // is better for this example. If we don't
    // perform well, we'll try to optimize this a bit
    // there is a room for optimization here.
    var p, result = {}, leaf, i;
    for (p in hash) {
        if (hash.hasOwnProperty(p)) {
            leaf = result;
            i = 0;
            do {
                if (p[i] in leaf) {
                    leaf = leaf[p[i]];
                } else {
                    leaf = leaf[p[i]] = {};
                }
                i += 1;
            } while (i < p.length);
            // since, obviously, no character
            // equals to empty character, we'll
            // use it to store the reference to the
            // original value
            leaf[""] = hash[p];
        }
    }
    return result;
}

function prefixReplaceHtml(html, trie) {
    "use strict";
    var i, len = html.length, result = [], lastMatch = 0,
        current, leaf, match, matched, replacement;
    for (i = 0; i < len; i += 1) {
        current = html[i];
        if (current === "<") {
            // don't check for out of bounds access
            // assume we never face a situation, when
            // "<" is the last character in an HTML
            if (match) {
                result.push(
                    html.substring(lastMatch, i - matched.length),
                    "<a href=\"", match, "\">", replacement, "</a>");
                lastMatch = i - matched.length + replacement.length;
                i = lastMatch - 1;
            } else {
                if (matched) {
                    // go back to the second character of the
                    // matched string and try again
                    i = i - matched.length;
                }
            }
            matched = match = replacement = leaf = "";
            if (html[i + 1] === "a") {
                // we want to skip replacing inside
                // anchor tags. We also assume they
                // are never nested, as valid HTML is
                // against that idea
                if (html[i + 2] in
                    { " " : 1, "\t" : 1, "\r" : 1, "\n" : 1 }) {
                    // this is certainly an anchor
                    i = html.indexOf("</a", i + 3) + 3;
                    continue;
                }
            }
            // if we got here, it's a regular tag, just look
            // for terminating ">"
            i = html.indexOf(">", i + 1);
            continue;
        }
        // if we got here, we need to start checking
        // for the match in the trie
        if (!leaf) {
            leaf = trie;
        }
        leaf = leaf[current];
        // we prefer longest possible match, just like POSIX
        // regular expressions do
        if (leaf && ("" in leaf)) {
            match = leaf[""];
            replacement = html.substring(
                i - (matched ? matched.length : 0), i + 1);
        }
        if (!leaf) {
            // newby-style inline (all hand work!) pay extra
            // attention, this code is duplicated few lines above
            if (match) {
                result.push(
                    html.substring(lastMatch, i - matched.length),
                    "<a href=\"", match, "\">", replacement, "</a>");
                lastMatch = i - matched.length + replacement.length;
                i = lastMatch - 1;
            } else {
                if (matched) {
                    // go back to the second character of the
                    // matched string and try again
                    i = i - matched.length;
                }
            }
            matched = match = replacement = "";
        } else if (matched) {
            // perhaps a bit premature, but we'll try to avoid
            // string concatenation, when we can.
            matched = html.substring(i - matched.length, i + 1);
        } else {
            matched = current;
        }
    }
    return result.join("");
}

function testPrefixReplace() {
    "use strict";
    var trie = buildTrie(
        { "x" : "www.xxx.com", "yyy" : "www.y.com",
          "xy" : "www.xy.com", "yy" : "www.why.com" });
    return prefixReplaceHtml(
        "<html><head>x</head><body><a >yyy</a><p>" +
            "xyyy yy x xy</p><abrval><yy>xxy</yy>", trie);
}
  • Thanks wvxvw. It'll take me some time to have a go with your script, but I'll get back to you over the next couple of days. Thanks for your input. – turbonerd Oct 11 '12 at 15:30
0

You can make anything work, the question is: is it worth the time you put into it?

Step 1, ditch the AJAX requirement. Ajax is for interactivity with the server, submitting small bits of data to the server and getting responses. Not ideal for what you are wanting.

Step 2, ditch the JS requirement, JS for interactivity with the user, you just really want to deliver a block of text with some words replaced with links, this should be handled server-side.

Step 3, focus on the php, if it's not efficient, attack that. Find ways of making it more efficient. What did you try in PHP? Why was it not efficient?

invertedSpear
  • 10,864
  • 5
  • 39
  • 77
  • @wvxvw Do you have stats to back that assertion up? I do precisely this kind of operation (albeit with a much smaller data set) using DOMDocument and XPath server-side, caching the result, on a relatively high traffic site hosted on a single VM. – josno Oct 10 '12 at 21:51
  • @wvxvw - Javascript may be faster than php, but what about the bandwidth to send over that object with 6k+ records in it, what if your client is on an old xp machine with abysmal specs? The OP never mentioned the number of hits, so I don't see how you determined he would need 10 servers to handle this load. Please provide some back-up for your, what appear to be, wild accusations there. – invertedSpear Oct 10 '12 at 22:02
  • @wvxvw - You are correct that bandwidth isn't that much of an issue, but remember, it's not just the word, but the URL it would link to as well (unless all words are linking to the same URL). Still bandwidth wouldn't account for much. The problem with using the benchmarks you linked to as gospel is the assumption the JS is running on an equivalent machine as the PHP. Other than extreme edge cases this is almost never true. Those benchmarks are practically meaningless unless OP is using his webserver to browse to a locally hosted page. – invertedSpear Oct 10 '12 at 23:59
  • @wvxvw - Maybe I misspoke, or you misunderstood. I was agreeing with your point that bandwidth is not a factor, even when you consider the extra data. But your points on server vs client performance do not persuade me. If I'm on a crappy VM, why would I care about electricity? Josno brought up a great point that if you process on the server, you can save the output for all future requests of that text, which is not something you would want to do client side. Why would I want to make every client suffer even a little bit every time, when I can make my server suffer the load only once? – invertedSpear Oct 11 '12 at 00:37
  • Hi @invertedSpear, thanks for your input. This is the question I originally asked: http://stackoverflow.com/questions/9359003/how-to-replace-glossary-terms-in-html-text-with-links. Simple PHP solutions just aren't efficient enough and I have absolutely no idea how that XPath solution works. In the end, if I don't understand how it works, I run the risk of - once again - corrupting my data. I don't think any "on load" server-side solution would be efficient enough to cope with 6,000 terms on pages containing 10,000+ characters especially with ~60,000 page views. – turbonerd Oct 11 '12 at 09:33
  • @wvxvw I'm not thinking about 30 years ago, I'm thinking about the machines I have to support now. sub Ghz processors, 256MB ram, spyware laden Win XPOSes that run 10 other programs on top of the web browser. Yeah, a few have beasts that they keep running smoothly, but these are hardcore gamers, developers and other techies, not the norm. When the POS machine has a problem with my site I'm throwing away a customer when I tell them "not my problem" Sure your laptop might best some servers, but you are not an average user and it is folly to think you have anything in common with one. – invertedSpear Oct 11 '12 at 16:36
0

If you have database access to the messages and the word list, I really suggest you do everything in PHP. While this can be done in JS, it will be a lot better as a server-side script.

In JS, basically, you would have to

  • Load the message
  • Load the "dictionnary"
  • Loop through each word of the dictionnary
    • Find match in DOM (ouch)
      • Replace

The first 2 points are requests, which puts a pretty big overhead. The loop will be taxing on the client's CPU.

Why I recommend doing this as a server-side code:

  • Servers are better for these types of jobs
  • JS run's on the client browser. Every client is different (example: someone might use IE which is less performing, or someone is using a smartphone)

This is pretty easy to do in PHP..

<?php
    $dict[] = array('word' => 'dolor', 'link' => 'DOLORRRRRR');
    $dict[] = array('word' => 'nulla', 'link' => 'NULLAAAARRRR');

    //  Pretty sure there's a more efficient way to separate an array.. my PHP is rusty, sorry. 
    $terms = array();
    $replace = array();
    foreach ($dict as $v) {
        // If you want to make sure it's a complete word, add a space to the term. 
        $terms[] = ' ' . $v['word'] . ' ';
        $replace[] = ' '. $v['link'] . ' ';
    }

    $text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

    echo str_replace($terms, $replace, $text);


    /* Output: 
    Lorem ipsum DOLORRRRRR sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure DOLORRRRRR in reprehenderit in voluptate velit esse cillum dolore eu fugiat NULLAAAARRRR pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
    */

?>

Although this script is pretty basic - it will not accept different cases.

What I would do:

If the PHP performance really hits you hard (I doubt it..), you can replace it once and save it. Then, when you add a new word, delete the caches and regenerates them (you can program a cron to do that)

Maktouch
  • 3,117
  • 20
  • 21
  • I will have to disagree with you wvxvw. This is not a hard job for a server. I'm not sure where you read that JS is faster than PHP, I would really like to see some stats. Even if a certain JS engine is faster, it does not mean that the visitor will have the same JS engine (not everyone is using chrome). – Maktouch Oct 10 '12 at 21:54
  • Thanks for your input Aktee. If you try running your script on an array of 6,000 terms with HTML content of around 1,500 words (~10,000 characters) I think you'll find that few servers cope well. I have a high-spec dedicated server exclusively for this website, but we're also getting around 30,000 unique visits per week - maybe 60,000 page views per day. I'm fairly confident that my server couldn't cope with that kind of usage, and when I tried such "simple" PHP solutions, it didn't. Hence the complexity of the solution in my original server-side post. – turbonerd Oct 11 '12 at 09:28
  • 1
    @dunc, how about caching it? I'm pretty sure that would be the best solution, unless you have a hdd issue. – Maktouch Oct 11 '12 at 13:15
  • Other than APC etc. I've never really looked at caching. Could you give me a brief/pseudo run-through of what I'd cache and how I'd do it in PHP? – turbonerd Oct 11 '12 at 14:08
  • 1
    I'll put another answer because it might take more than 600 chars ;) – Maktouch Oct 11 '12 at 14:22