1

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.

Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.

What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.

Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.

So, my question is how could I force the expression to somehow skip one in two whitespaces?

Christos Karapapas
  • 1,018
  • 3
  • 19
  • 40

3 Answers3

1

First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:

\W+

Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:

<a "word"> <separators that are matched by the pattern "\W+"> <another "word">

This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.

For doing the first split you can try this formula:

\w+\W+\w+\K\W+

Then, for each token you have to tokenize it again using:

\W+

For getting tokens of 3 "words", you can use the following pattern for the initial split:

\w+\W+\w+\W+\w+\K\W+

This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.

In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).

Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:

\p{L}+\P{L}+\p{L}+\K\P{L}+

Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.

Manthos
  • 41
  • 7
0

What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.

This should work for your purposes:

(\w+)(?=(\W+\w+))

The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".

Here is a link to a demo on Regex101

Note that for each match, each word is in its own capture group (group 1, group 2)

anerisgreat
  • 342
  • 1
  • 7
  • This doesn't work as I described. I found out that the solution for this problem is related to n-grams, take a look at this https://regex101.com/r/UIN8mM/3 – Christos Karapapas Nov 03 '19 at 20:19
0

Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

Christos Karapapas
  • 1,018
  • 3
  • 19
  • 40