Split sentence to words containing apostrophe

Question

Supposing I have a group of words as a sentence like this :

Aujourd'hui séparer l'élément en deux

And want the result to be as an individual words (after the split) :

Aujourd'hui | séparer | l' | élément | en | deux

Note : as you can see, « aujourd'hui » is a single word.

What would be the best regex to use here ?

With my current knowledge, all i can achieve is this basic operation :

QString sentence("Aujourd'hui séparer l'élément en deux");
QStringList list = sentence.split(" ");

Output :

Aujourd'hui / Séparer / l'élément / en / deux

Here are the two questions closest to mine : this and this.

score 1 · Answer 1 · answered Apr 18 '20 at 12:51

1

Not sure if I understood what you are saying but this might help you

QString sentence("Aujourd'hui séparer l'élément en deux");
QStringList list = sentence.split(" '");

answered Apr 18 '20 at 12:51

Astro

85
7

score 1 · Answer 2 · answered Apr 18 '20 at 13:12

1

I don't know C++ but I guees it supports negative lookbehind.

Have a try with:

(?: |(?<!\w{2})')

This will split on space or apostroph if there are not 2 letters before.

Demo & explanation

answered Apr 18 '20 at 13:12

Toto

89,455
62
89
125

score 1 · Answer 3 · answered Apr 18 '20 at 13:24

Well, you're dealing with a natural language, here, and the first - and toughest - problem to answer is: Can you actually come up with a fixed rule, when splits should happen? In this particular case, there is really no logical reason, why French considers "aujourd'hui" as a single word (when logically, it could be parsed as "au jour de hui").

I'm not familiar with all the possible pitfalls in French, but if you really want to make sure to cover all obscure cases, you'll have to look for a natural language tokenizer.

Anyway, for the example you give, it may be good enough to use a QRegularExpression with negative lookbehind to omit splits when more than one letter precedes the apostrophe:

sentence.split(QRegularExpression("(?<![\\w][\\w])'"));

score 1 · Accepted Answer · answered Apr 18 '20 at 13:42

Since the contractions that you want to consider as separate words are usually a single letter + an apostrophe in French (like l'huile, n'en, d'accord) you can use a pattern that either matches 1+ whitespace chars, or a location that is immediately preceded with a start of a word, then 1 letter and then an apostrophe.

I also suggest taking into account curly apostrophes. So, use

 \s+|(?<=\b\p{L}['’])\b

See the regex demo.

Details

\s+ - 1+ whitespaces
| - or
(?<=\b\p{L}['’])\b - a word boundary (\b) location that is preceded with a start of word (\b), a letter (\p{L}) and a ' or ’.

In Qt, you may use

QStringList result = text.split(
     QRegularExpression(R"(\s+|(?<=\b\p{L}['’])\b)", 
        QRegularExpression::PatternOption::UseUnicodePropertiesOption)
);

The R"(...)" is a raw string literal notation, you may use "\\s+|(?<=\\b\\p{L}['’])\\b" if you are using a C++ environment that does not allow raw string literals.

Split sentence to words containing apostrophe

4 Answers4