1

Supposing I have a group of words as a sentence like this :

Aujourd'hui séparer l'élément en deux

And want the result to be as an individual words (after the split) :

Aujourd'hui | séparer | l' | élément | en | deux

Note : as you can see, « aujourd'hui » is a single word.

What would be the best regex to use here ?


With my current knowledge, all i can achieve is this basic operation :

QString sentence("Aujourd'hui séparer l'élément en deux");
QStringList list = sentence.split(" ");

Output :

Aujourd'hui / Séparer / l'élément / en / deux

Here are the two questions closest to mine : this and this.

Bo Halim
  • 1,716
  • 2
  • 17
  • 23

4 Answers4

1

Not sure if I understood what you are saying but this might help you

QString sentence("Aujourd'hui séparer l'élément en deux");
QStringList list = sentence.split(" '");
Astro
  • 85
  • 7
1

I don't know C++ but I guees it supports negative lookbehind.

Have a try with:

(?: |(?<!\w{2})')

This will split on space or apostroph if there are not 2 letters before.

Demo & explanation

Toto
  • 89,455
  • 62
  • 89
  • 125
1

Well, you're dealing with a natural language, here, and the first - and toughest - problem to answer is: Can you actually come up with a fixed rule, when splits should happen? In this particular case, there is really no logical reason, why French considers "aujourd'hui" as a single word (when logically, it could be parsed as "au jour de hui").

I'm not familiar with all the possible pitfalls in French, but if you really want to make sure to cover all obscure cases, you'll have to look for a natural language tokenizer.

Anyway, for the example you give, it may be good enough to use a QRegularExpression with negative lookbehind to omit splits when more than one letter precedes the apostrophe:

sentence.split(QRegularExpression("(?<![\\w][\\w])'"));
Tfry
  • 194
  • 6
1

Since the contractions that you want to consider as separate words are usually a single letter + an apostrophe in French (like l'huile, n'en, d'accord) you can use a pattern that either matches 1+ whitespace chars, or a location that is immediately preceded with a start of a word, then 1 letter and then an apostrophe.

I also suggest taking into account curly apostrophes. So, use

 \s+|(?<=\b\p{L}['’])\b

See the regex demo.

Details

  • \s+ - 1+ whitespaces
  • | - or
  • (?<=\b\p{L}['’])\b - a word boundary (\b) location that is preceded with a start of word (\b), a letter (\p{L}) and a ' or .

In Qt, you may use

QStringList result = text.split(
     QRegularExpression(R"(\s+|(?<=\b\p{L}['’])\b)", 
        QRegularExpression::PatternOption::UseUnicodePropertiesOption)
);

The R"(...)" is a raw string literal notation, you may use "\\s+|(?<=\\b\\p{L}['’])\\b" if you are using a C++ environment that does not allow raw string literals.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563