0

is there a command in scala to ignore all kind of numbers, such as " IgnoreNumbers() ~> "?

I'm a scala newbie and, in fact, I only have to use one script in this language.

Thanks a lot for any help!

It's for a tokenizer from here http://nlp.stanford.edu/software/tmt/tmt-0.4/examples/example-1-dataset.scala:

val tokenizer = {
  SimpleEnglishTokenizer() ~>            // Remove punctuation
  CaseFolder() ~>                        // Lowercase everything
  WordsAndNumbersOnlyFilter() ~>         // Ignore non-words and non-numbers
  MinimumLengthFilter(3)                 // Take terms with >=3 characters
}
user2864740
  • 60,010
  • 15
  • 145
  • 220
MarkF6
  • 493
  • 6
  • 21

1 Answers1

0

I've never used ScalaNLP, but it looks like it is trivial to modify (or better, create a new type) based on WordsAndNumbersOnlyFilter by simply removing the Number usage, e.g.

case class WordsOnlyFilter() extends Transformer {
  // original from WordsAndNumbersOnlyFilter
  // override def apply(terms : Iterable[String]) =
  //   terms.filter(term => TokenType.Word.matches(term) || TokenType.Number.matches(term));

  // Modification that doesn't use/accept TokenType.Number
  override def apply(terms : Iterable[String]) =
    terms.filter(term => TokenType.Word.matches(term));
}

Then:

val tokenizer = {
  // ..
  WordsOnlyFilter() ~>         // Ignore non-words
  // ..
}
user2864740
  • 60,010
  • 15
  • 145
  • 220
  • Thanks a lot for this proposition! :) I've got a stupid question: Where can / should I modify this? I used the tmt-0.4.0.jar, downloaded from here: http://nlp.stanford.edu/software/tmt/tmt-0.4/ – MarkF6 May 02 '14 at 09:21
  • @MarkF6 Just include it (WordsOnlyFilter) in your code. Make sure that the correct import statements (probably `chalk.text.transform`) are applied so that Transformer/TokenType.Word can be found. – user2864740 May 02 '14 at 09:27
  • Hm it gives me this error: "scalanlp.serialization.TypedCompanionException: No registered handler supports value type class Main$$anon$1$WordsOnlyFilter" – MarkF6 May 02 '14 at 10:05
  • But I didn't add "import chalk.text.transform", because it could not be found :( – MarkF6 May 02 '14 at 10:05
  • 1
    @MarkF6 Maybe that's not what the scalanlp library you have is using. In any case, simply find and modify/duplicate the `WordsAndNumbersOnlyFilter` filter class as a appropriate. – user2864740 May 02 '14 at 20:04