1

I have one sentence :

sentence1 = "Vincennes Confirmation des privilèges de la villed'Aire au bailliage d'Amiens Mai 1498 Aire-sur-la-Lys, Pas-de-Calais, arrondissement de Saint-Omer."

my script below returns the start offset, end offset and word :

import re

for element in re.finditer(r"[\w'-]+|[.,!?;]", sentence1):
        start = element.start()
        end = element.end()
        value = sentence1[start:end]
        print(start, end, value)

I get the following output :

0 9 Vincennes
10 22 Confirmation
23 26 des
27 37 privilèges
38 40 de
41 43 la
44 55 villed'Aire
56 58 au
59 68 bailliage
69 77 d'Amiens
78 81 Mai
82 86 1498
87 102 Aire-sur-la-Lys
102 103 ,
104 117 Pas-de-Calais
117 118 ,
119 133 arrondissement
134 136 de
137 147 Saint-Omer

...

My output is the one I want but I'm looking for a better regex than [\w'-]+|[.,!?;] to tokenize words with apostrophes as exemple :

d'Amiens => ["d'", "Amiens"]
d'Abrimcourt =>> ["d'", "Abrimcourt"]
...

but not :

villed'Aire => ["villed'Aire"]
...

Anyone have an idea ? thanks.

Lter
  • 43
  • 11
  • 1
    Sounds like you might want a more refined tool than a regex – maybe nltk? https://stackoverflow.com/a/42621812/51685 – AKX Mar 24 '21 at 10:24

1 Answers1

1

You can use

\b[dlnmtsj]'|\w+(?:['-]\w+)*|[.,!?;]
\b[dlnmtsj]'|\w+(?:['-]\w+)*|[^\w\s]

See the regex demo.

Details:

  • \b[dlnmtsj]' - start of a word and then d (e.g. d'argent), l (e.g. l'huile), n (e.g. n'en), m (e.g. m'appelle), t (e.g. t'appelles), s (e.g. s'appelle) or j (e.g. j'ai) followed with '
  • | - or
  • \w+(?:['-]\w+)* - one or more word chars followed with ' or - and then one or more word chars
  • | - or
  • [.,!?;] - ., ,, !, ? or ;. Replace with [^\w\s] to match any char other than a word and whitespace char.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • @LucasTerriel Yes, if you need to treat a `_` char as a special one that breaks words, you should use `\b[dlnmtsj]'|[^\W_]+(?:['-][^\W_]+)*|[^\w\s]|_` – Wiktor Stribiżew Mar 24 '21 at 10:37