8

I have to recognize a large list of URLs (few million lines) as belonging to a particular category or not. I have another list that has sub-strings that if present in the URL belongs to that category. Say, Category A.

The list of sub-strings to check has around 10k such sub-strings. What I did was simply go line by line in the sub-string file and look for the match and if found the URL belongs to Category A. I found in tests that this was rather time consuming.

I'm not a computer science student so don't have much knowledge about optimizing algorithms. But is there a way to make this faster? Just simple ideas. The programming language is not a big issue but Java or Perl would be preferable.

The list of sub-strings to match will not change much. I will however receive different lists of URLs so have to run this each time I get it. The bottleneck seems to be the URLs as they can get very long.

sfactor
  • 12,592
  • 32
  • 102
  • 152
  • 1
    you can use some Information Retrieval system (i.e. Lucene - in Java) to index the URLS, and then search for the string, The indexing will be time consuming, but it will save time for each "query" - not having to iterate over the whole list. – amit Apr 13 '11 at 07:41
  • 1
    10k times, say, 10 million is what, 100 billion? yeah, that'll take some time regardless of the language. if something is in category A, does it mean they can't be in any other category? if so, you can remove everything from the big list that gets assigned to a category – jcomeau_ictx Apr 13 '11 at 07:44
  • 1
    the list of substrings is constant there is no reason for it to take a long time, see my answer the length of the list only affects the size taken in memory for the automata and even that will probably be small – Asaf Apr 13 '11 at 07:46
  • Would parallelization help you? In addition to other optimizations of course. – jmg Apr 13 '11 at 07:49
  • although I answered this before checking it is a duplicate of [http://stackoverflow.com/questions/1765579/fast-algorithm-for-searching-for-substrings-in-a-string](http://stackoverflow.com/questions/1765579/fast-algorithm-for-searching-for-substrings-in-a-string) – Asaf Apr 13 '11 at 08:19

9 Answers9

8

Yes, I implemented the Aho-Corasick algorithm algorithm in java for the problem you are suggesting and it showed a consistent improvement of about x180 on the naive implementation (what you are doing). There are several implementations available online, although I would tweak them for better performance. Note that the solutions complexity is bounded by the length of the word (in your case the URL) and not the number of sub-strings. furthermore it only requires one pass on average for matching.

P.S - we used to give this question to people in job interviews, so there are many ways to solve it. The one I offer is the one we use in production code, which (for now) beats all other solutions.

Edit: wrote the wrong algorithm name previously, fixed...

Asaf
  • 6,384
  • 1
  • 23
  • 44
  • Hey thanks, the Aho-Corasick algorithm worked like a charm. got about x50 improvement on the naive implementation. just one more curiosity, what was the performance of the KMP algorithm? would that be even more faster? :) – sfactor Apr 13 '11 at 11:44
  • 1
    Nah, forget about KMP. That was a mistake I made (copied the name from the wrong email) the Aho-Corasick algorithm is linear in the length of the input and simple to understand/implement I wouldn't bother with more optimization unless really necessary. what I did to speed it up is just use arrays everywhere for representing the edges between nodes (as opposed to maps) and the original algorithm returns the location of the match. you can speed it up even more if you wave that ability. – Asaf Apr 13 '11 at 11:57
4

Perl is very good at optimizing long lists of alternate strings in a regular expression (up to a certain overall compiled regex length, where it reverts to a less efficient mechanism). You should be able to construct a regex to match a certain category like:

$catAre = join( '|', map quotemeta, @catAstrings );
$catAre = qr/$catAre/;
ysth
  • 96,171
  • 6
  • 121
  • 214
3

Different approaches are of course possible to optimize this. Regarding your background, I'll sketch you a simple one. Which assume that the list of sub-strings does not change very often.

  1. Generate one huge regular expression from all the sub-strings.
  2. Compile that regexp, see. the class Pattern in Java for example. Store the refrence to that compiled regular expression.
  3. Use the same compiled regular expression to match every url.
jmg
  • 7,308
  • 1
  • 18
  • 22
  • 1
    I would wager this will perform very poorly, have you tried doing something like that with 10k strings? bullet (1) would be very difficult to pull off efficiently and barring that the rest will end up almost as inefficient as the naive implementation – Asaf Apr 13 '11 at 07:52
  • @Asaf: It will perform poorly if you have a bad regexp engine or if you don't precompile the regexp. But otherwise it should create an automaton almost as good the one of KMP algorithm. I wanted to give an approach which is applicable without deep computer science knowledge. Otherwise KMP is the obvious solution. – jmg Apr 13 '11 at 08:04
  • 1
    @jmg, I agree KMP is a bit complex, I meant Aho-Corasick and fixed my answer accordingly (I used KMP for something different). Aho-Corasick has many ready made [implementations](https://hkn.eecs.berkeley.edu/~dyoo/java/index.html) and I feel is relatively easy to understand. Furthermore, assuming you would somehow build the "perfect" regex out of the 10k strings which I think is a harder algorithmic problem then the original one, I don't see how the solution won't be dependent (complexity wise) by the number of sub-strings – Asaf Apr 13 '11 at 08:09
  • @Asaf: Building the optimal regexp or better automaton is a task for the regexp compiler, which will probably be done badly by backtracking based regexp implementations. – jmg Apr 13 '11 at 08:33
  • @asif @jmg thanks guys, i'll test these methods and see which one works for me. – sfactor Apr 13 '11 at 09:02
  • 3
    @Asaf: perl regexes use Aho-Corasick – ysth Apr 13 '11 at 09:10
  • 2
    @Asaf: Why, yes, I *have* done something like this with ~13k strings. Works like a champ, although I did it in Perl and punted step 1 over to the Regexp::Assemble module rather than attempting to build/optimize the regex myself. – Dave Sherohman Apr 13 '11 at 09:59
  • @Dave, I understand that the Perl::Assemble (wasn't familiar with that package) does in fact use Aho-Corasick as the implementation which is the same solution. I would still opt for the java implementation simply because it is simple code and I can further optimize it to my needs, but that is personal preference. – Asaf Apr 13 '11 at 12:47
  • 1
    @Asaf, @Dave Sherohman: perl has natively used Aho-Corasick for multiple fixed strings for several years; Regexp::Assemble is one of many similar modules that predates that and is now only useful for combining more complex regexes. – ysth Apr 13 '11 at 17:31
  • @ysth: By "for several years", am I correct to assume you mean "since perl 5.10"? I know that the smart match operator handles this sort of thing well enough to not need R::A, but I'm not aware of anything similar having been in the perl 5.8 core. – Dave Sherohman Apr 14 '11 at 07:20
  • Nothing to do with smart matching; back in 2006 this was in 5.9.4, and in 2007 in 5.10.0: http://perldoc.perl.org/perl5100delta.html#Regular-expressions-optimisations – ysth Apr 14 '11 at 07:36
2

I'd suggest using the venerable Grep instead of using a programming language for this task. It uses the fast Boyer-Moore string searching algorithm, which should be sufficient for a few million lines.

darioo
  • 46,442
  • 10
  • 75
  • 103
  • I'm not sure grep would be efficient here, the algorithm jumps ahead if it is not possible to match which works great for a small number of words, if you have 10k words grep will probably have a hard time skipping ahead (or back depending on the optimization) since there will be many common prefixes (or suffixes) – Asaf Apr 13 '11 at 07:50
2

I've done this sort of thing before in Perl, comparing a list of ~13k keywords against a incoming stream of data from Twitter to find all tweets matching any of those keywords (and which keywords each matches). In rough terms, the code looks like:

use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add(@keywords);
my $regex = $ra->re;

for my $tweet (@tweets) {
  my @matches = $tweet =~ /$regex/g;
  # do whatever with @matches...
}

Note that this uses Regexp::Assemble to build the regex, which is not part of the core Perl distribution, so you'll need to install if from CPAN if you want to adapt this code.

If you're using perl 5.10 or later, there's also the "smart match" operator (~~) which can do something similar without requiring any additional modules.

Dave Sherohman
  • 45,363
  • 14
  • 64
  • 102
1

You could compress the substrings into classes sharing the same prefix. This should cut down the time by a significant margin.

If you're looking for matches by shifting the string by 1 each iteration, you can improve your speed quite a bit using a better algorithm (as with regular expressions).

  • The prefix optimization is done automatically if you put all sub-strings into one regular expression, at least by a reasonably optimized regular expression engines. – jmg Apr 13 '11 at 07:45
1

For Java libraries that implement common string search algorithms see the answers to https://stackoverflow.com/questions/5564610/fast-alernative-for-stringindexofstring-str. Coupled with parallelization you should be able to parse millions of URL's fairly quickly. It's easy enough to do; you should probably try it out and see if the time is acceptable or not before looking too much further into optimization.

Community
  • 1
  • 1
WhiteFang34
  • 70,765
  • 18
  • 106
  • 111
1

I wrote it first as comment, but then I realized, I think it is more appropriate as an answer
You can use some Information retrieval system (like Apache Lucene in Java) and use it to index the URLs as documents.
then, after indexing - you can iterate over the queries, and search for each of them, the result will be the matching URLs.
PROS:
* searching will not require iterating over all the URl's for each query.
* if you will later need intersection or union of substring/queries - the library gives you this functionality
CONS:
*indexing will take a while...
*you might need some extra space on RAM/disk for the index.

I think it's a method worth exploring, maybe the time consumed while indexing worth the gain of searching.

amit
  • 175,853
  • 27
  • 231
  • 333
0

I am currently working on this problem. I came to this conclusion:

Aho-corasick will consume more memory while making tree. If there is no issue of memory than its good. But look on the HAT Trie once. It is the combination of hash and trie (tree). It will make a tree at some level and the remaining characters will form a hash value that should be marked in Hash table.

Sorry about more technical language. But i think HAT trie is better option if you are searching a specific URL from the list of URL. (I have formed a HAT trie which will consume 12MB for storing 6lacks of URL.)

Brijesh Valera
  • 1,085
  • 2
  • 9
  • 30