0

I have a list of segments (15000+ segments), I want to find out the occurence of segments in a given string. The segment can be single word or multiword, I can not assume space as a delimeter in string.

e.g. String "How can I download codec from internet for facebook, Professional programmer support"

[the string above may not make any sense but I am using it for illustration purpose]

segment list

  • Microsoft word
  • Microsoft excel
  • Professional Programmer.
  • Google
  • Facebook
  • Download codec from internet.

Ouptut :

  • Download codec from internet
  • facebook
  • Professional programmer

Bascially i am trying to do a query reduction.

I want to achieve it less than O(list length + string length) time. As my list is more than 15000 segments, it will be time consuming to search entire list in string. The segments are prepared manully and placed in a txt file.

Regards

~Paul

Paul
  • 1

3 Answers3

1

You basically want a string search algorithm like Aho-Corasik string matching. It constructs a state machine for processing bodies of text to detect matches, effectively making it so that it searches for all patterns at the same time. It's runtime is on the order of the length of the text and the total length of the patterns.

Chris Pitman
  • 12,990
  • 3
  • 41
  • 56
  • I created an Aho-Corasick matcher that has over 10 million strings (artists names and song titles from the MusicBrainz database). Given a string, I can return all the artists and titles that it matches. The thing is amazingly fast, letting me process thousands of strings per second. – Jim Mischel Mar 12 '11 at 07:18
0

In order to do efficient searches, you will need an auxiliary data structure in the form of some sort of index. Here, a great place to start would be to look at a KWIC index:

http://en.wikipedia.org/wiki/Key_Word_in_Context

http://www.cs.duke.edu/~ola/ipc/kwic.html

anon
  • 4,578
  • 3
  • 35
  • 54
0

What your basically asking how to do is write a custom lexer/parser.

Some good background on the subject would be the Dragon Book or something on lex and yacc (flex and bison).

Take a look at this question:

Poor man's lexer for C#

Now of course, alot of people are going to say "just use regular expressions". Perhaps. The deal with using regex in this situation is that your execution time will grow linearly as a function of the number of tokens you are matching against. So, if you end up needing to "segment" more phrases, your execution time will get longer and longer.

What you need to do is have a single pass, popping words on to a stack and checking if they are valid tokens after adding each one. If they aren't, then you need to continue (disregard the token like a compiler disregards comments).

Hope this helps.

Community
  • 1
  • 1
James
  • 641
  • 3
  • 10