17

This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain.

Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text.

Edit: Please note that the tokens are to be used in a search index. Creating such tokens is not an exact science (afaik) and requires some heuristics. This has been done a thousand time before, and probably in a thousand different ways, but I can't even find one of them :)

Any good pointers?

Thanks!

Rabbit
  • 1,741
  • 2
  • 18
  • 27

7 Answers7

16

The C++ String Toolkit Library (StrTk) has the following solution to your problem:

#include <iostream>
#include <string>
#include <deque>
#include "strtk.hpp"

int main()
{
   std::deque<std::string> word_list;
   strtk::for_each_line("data.txt",
                        [&word_list](const std::string& line)
                        {
                           const std::string delimiters = "\t\r\n ,,.;:'\""
                                                          "!@#$%^&*_-=+`~/\\"
                                                          "()[]{}<>";
                           strtk::parse(line,delimiters,word_list);
                        });

   std::cout << strtk::join(" ",word_list) << std::endl;

   return 0;
}

More examples can be found Here

1

If performance is a main issue you should probably stick to good old strtok which is sure to be fast:

/* strtok example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] ="- This, a sample string.";
  char * pch;
  printf ("Splitting string \"%s\" into tokens:\n",str);
  pch = strtok (str," ,.-");
  while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ,.-");
  }
  return 0;
}
clyfe
  • 23,695
  • 8
  • 85
  • 109
  • 1
    strtok is **not** a tokenizer. You still have to figure out the difference between a `class` or a `const` or an identifier that happens to be named something like `calculate`. – T.E.D. Apr 08 '10 at 14:44
  • 3
    A tokenizer *identifies* the tokens and afterwords a *lexical anlizer* categorizes them into tokens ( ie. phrase "joe eats" -> tokenizer -> {joe, eats} -> lexical analizer -> { (joe, noun), (eats, verb) } ). Tokenization is the process of *demarcating* and **possibly** classifying sections of a string of input characters. In the classifying sence neither the boost tokenizer does the classification. – clyfe Apr 08 '10 at 14:54
  • http://stackoverflow.com/questions/380455/looking-for-a-clear-definition-of-what-a-tokenizer-parser-and-lexers-are-a – clyfe Apr 08 '10 at 15:31
  • Using `C` for "performance" while in `C++` is usually a case of premature optimization... I don't say streams are not a bit slower though ;) – Matthieu M. Apr 08 '10 at 16:35
  • Hmmm. Back in my day they used to be synonomous. I see from wikipedia that is no longer the case. Damn kids, killing my grass and changing my language... – T.E.D. Apr 08 '10 at 22:39
1

A regular expression library might work well if your tokens aren't too difficult to parse.

Jay
  • 13,803
  • 4
  • 42
  • 69
0

I might look into std::stringstream from <sstream>. C-style strtok has a number of usability problems, and C-style strings are just troublesome.

Here's an ultra-trivial example of it tokenizing a sentence into words:

#include <sstream>
#include <iostream>
#include <string>

int main(void) 
{
   std::stringstream sentence("This is a sentence with a bunch of words"); 
   while (sentence)
   {
      std::string word;  
      sentence >> word;  
      std::cout << "Got token: " << word << std::endl;
   }
}

janks@phoenix:/tmp$ g++ tokenize.cc && ./a.out
Got token: This
Got token: is
Got token: a
Got token: sentence
Got token: with
Got token: a
Got token: bunch
Got token: of
Got token: words
Got token:

The std::stringstream class is "bi-directional", in that it supports input and output. You'd probably want to do just one or the other, so you'd use std::istringstream or std::ostringstream.

The beauty of them is that they are also std::istream and std::ostreams respectively, so you can use them as you'd use std::cin or std::cout, which are hopefully familiar to you.

Some might argue these classes are expensive to use; std::strstream from <strstream> is mostly the same thing, but built on top of cheaper C-style 0-terminated strings. It might be faster for you. But anyway, I wouldn't worry about performance right away. Get something working, and then benchmark it. Chances are you can get enough speed by simply writing well-written C++ that minimizes unnecessary object creation and destruction. If it's still not fast enough, then you can look elsewhere. These classes are probably fast enough, though. Your CPU can waste thousands of cycles in the amount of time it takes to read a block of data from a hard disk or network.

janks
  • 2,120
  • 16
  • 13
  • 1
    This a aproach does a bad job on punctuation: "This is: a sentence, with a bunch of words" -> ("This" "is:" "a" "sentence," "with" "a" "bunch" "of" "words"), although i believe it can be overcomed ... also tokenizes only on whitespace: http://codepad.org/m69UhzKN – clyfe Apr 08 '10 at 15:28
  • 2
    Obviously, hence the "ultra-trivial" comment. Of course there are myriad member functions of `std::istream` that will allow you to customize the tokenization to, for example, use different delimiters, etc. I'm not suggesting that the tokenization should literally be built on top of operator>>, that was just for the trivial example. – janks Apr 08 '10 at 15:32
0

You can use the Ragel State Machine Compiler to create a tokenizer (or a lexical analyzer).

The generated code has no external dependencies.

I suggest you look at the clang.rl sample for a relevant example of the syntax and usage.

Hasturkun
  • 35,395
  • 6
  • 71
  • 104
  • raegel is a full blown parser generator (albeit a fast one), i think it's to much for tokenization (index creation) needs (or even more, completely useless) – clyfe Apr 08 '10 at 20:48
  • @clyfe: I don't think so, really... Ragel's own tokenizer is in fact written in ragel, and the output code is very lightweight. – Hasturkun Apr 08 '10 at 23:11
  • This is for programming languages not for natural languages. Total different. – Lothar Apr 08 '18 at 03:50
0

Well, I would begin by searching Boost and... hop: Boost.Tokenizer

The nice thing ? By default it breaks on white spaces and punctuation because it's meant for text, so you won't forget a symbol.

From the introduction:

#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>

int main(){
   using namespace std;
   using namespace boost;
   string s = "This is,  a test";
   tokenizer<> tok(s);
   for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
       cout << *beg << "\n";
   }
}

// prints
This
is
a
test

// notes how the ',' and ' ' were nicely removed

And there are additional features:

  • it can escape characters
  • it is compatible with Iterators so you can use it with an istream directly... and thus with an ifstream

and a few options (like keeping empty tokens etc...)

Check it out!

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
0

I wrote my own tokenizer as part of the open-source SWISH++ indexing and search engine.

There's also the the ICU tokenizer that handles Unicode.

Paul J. Lucas
  • 6,895
  • 6
  • 44
  • 88
  • I second the use of ICU tokenizer. From all suggestions here it is the closest answer. There are better in NLP packages but they are usually very slow because they do much more language tagging (verb, noun, adjective analysis) and require huge vocabulary files. – Lothar Apr 08 '18 at 03:53