I'm quite new to regular expressions, and cannot figure out how to do what I want. I have a text file as input, and want to extract "articles" from it. The problem is that if I read the text line per line, it cannot cover the entire article since it stops when it encounters a carriage return.
What I would like to do, is to extract everything from a specific pattern until it repeats, whether or not it encounters a carriage return (Python).
Example of sequences :
Article ler — NOM
Latius iam disseminata licentia onerosus bonis omnibus Caesar nullum post haec adhibens modum orientis latera cuncta vexabat nec honoratis parcens nec urbium primatibus nec plebeiis.
Article 2 — ANNEE
Nemo quaeso miretur, si post exsudatos labores itinerum longos congestosque adfatim commeatus fiducia vestri ductante barbaricos pagos adventans velut mutato repente consilio ad placidiora deverti.
Article 3 — DATE Ego vero sic intellego, Patres conscripti, nos hoc tempore in provinciis decernendis perpetuae pacis habere oportere rationem.
And this is the regular expression I have designed : "^(.*(?=((?i)article(\s\d{1,2})*)).*)"
As output, I obtain something like that :
Article ler — NOM
Article 2 — ANNEE
Article 3 — DATE Ego vero sic intellego, Patres conscripti, nos hoc tempore in provinciis decernendis perpetuae pacis habere oportere rationem.
The two first ones don't cover the entire article (title + content) : this is my problem. Does anybody know how to resolve it?
Thanks!
if __name__ == "__main__":
label_pattern = ("^(.*(?=((?i)article(\s\d{1,2})*)).*)")
pattern = re.compile(label_pattern)
for i, line in enumerate(open('texte.txt')):
for match in re.finditer(pattern, line):
print(i+1, match.group(1))