I have one sentence :
sentence1 = "Vincennes Confirmation des privilèges de la villed'Aire au bailliage d'Amiens Mai 1498 Aire-sur-la-Lys, Pas-de-Calais, arrondissement de Saint-Omer."
my script below returns the start offset, end offset and word :
import re
for element in re.finditer(r"[\w'-]+|[.,!?;]", sentence1):
start = element.start()
end = element.end()
value = sentence1[start:end]
print(start, end, value)
I get the following output :
0 9 Vincennes
10 22 Confirmation
23 26 des
27 37 privilèges
38 40 de
41 43 la
44 55 villed'Aire
56 58 au
59 68 bailliage
69 77 d'Amiens
78 81 Mai
82 86 1498
87 102 Aire-sur-la-Lys
102 103 ,
104 117 Pas-de-Calais
117 118 ,
119 133 arrondissement
134 136 de
137 147 Saint-Omer
...
My output is the one I want but I'm looking for a better regex than [\w'-]+|[.,!?;]
to tokenize words with apostrophes as exemple :
d'Amiens => ["d'", "Amiens"]
d'Abrimcourt =>> ["d'", "Abrimcourt"]
...
but not :
villed'Aire => ["villed'Aire"]
...
Anyone have an idea ? thanks.