0

I'm quite new to regular expressions, and cannot figure out how to do what I want. I have a text file as input, and want to extract "articles" from it. The problem is that if I read the text line per line, it cannot cover the entire article since it stops when it encounters a carriage return.

What I would like to do, is to extract everything from a specific pattern until it repeats, whether or not it encounters a carriage return (Python).

Example of sequences :

Article ler — NOM
Latius iam disseminata licentia onerosus bonis omnibus Caesar nullum post haec adhibens modum orientis latera cuncta vexabat nec honoratis parcens nec urbium primatibus nec plebeiis.
Article 2 — ANNEE
Nemo quaeso miretur, si post exsudatos labores itinerum longos congestosque adfatim commeatus fiducia vestri ductante barbaricos pagos adventans velut mutato repente consilio ad placidiora deverti.
Article 3 — DATE Ego vero sic intellego, Patres conscripti, nos hoc tempore in provinciis decernendis perpetuae pacis habere oportere rationem.

And this is the regular expression I have designed : "^(.*(?=((?i)article(\s\d{1,2})*)).*)"

As output, I obtain something like that :

Article ler — NOM
Article 2 — ANNEE
Article 3 — DATE Ego vero sic intellego, Patres conscripti, nos hoc tempore in provinciis decernendis perpetuae pacis habere oportere rationem.

The two first ones don't cover the entire article (title + content) : this is my problem. Does anybody know how to resolve it?

Thanks!


if __name__ == "__main__":

    label_pattern = ("^(.*(?=((?i)article(\s\d{1,2})*)).*)")

    pattern = re.compile(label_pattern)

    for i, line in enumerate(open('texte.txt')):
        for match in re.finditer(pattern, line):
            print(i+1, match.group(1))

Sol
  • 57
  • 7

2 Answers2

2

If it is possible, read the whole file text and apply the following regex to that text:

(?<=Article)[\s\S]*?(?=Article|$)

Click for Demo

Explanation:

  • (?<=Article) - positive lookbehind to find the position immediately preceded by the text Article
  • [\s\S]*? - matches 0+ occurrences of any character(even the newlines). ? is present to make the match lazy.
  • (?=Article|$) - Positive lookahead to find the position that is immediately followed by either another Article or end-of-full-string represented by $
Gurmanjot Singh
  • 10,224
  • 2
  • 19
  • 43
  • I got my code to work with this pattern, but I had to change the `$` to a `\Z`. Could it be something with the way strings are read in from a file? – pault Dec 26 '17 at 15:36
  • 1
    For an explanation on that, you can see [this question](https://stackoverflow.com/questions/22519318/regex-differences-between-and-a-z) – Gurmanjot Singh Dec 26 '17 at 15:38
  • Just, is it possible to include the word "Article" in the match? :) – Sol Dec 26 '17 at 15:45
  • @Sol Yes, just rewrite the regex as [`Article[\s\S]*?(?=Article|$)`](https://regex101.com/r/quwApD/2) – Gurmanjot Singh Dec 26 '17 at 15:46
  • Haha, just perfect! However, the "\n" symbol appear in the matches; maybe I can remove it aftermath? – Sol Dec 26 '17 at 15:48
  • Hmmm...I don't see any `\n` in the matches. Anyways, `[\s\S]` is going to match anything. – Gurmanjot Singh Dec 26 '17 at 15:49
  • With "matches = re.findall(label_pattern, file_text)", I have \n in my results; weird... – Sol Dec 26 '17 at 15:52
  • @Sol The original text file has `\n` in it, hence you see those in the matches. If you want to remove/replace them, feel free to do so as per your requirement. May be you can replace them with a space or even empty string – Gurmanjot Singh Dec 26 '17 at 15:52
1

Your problem is the way you are reading the file. If you iterate through the lines in the file, then you won't be able to get multi-line matches. Instead, if you want to use regex, read the whole file in at once.

(Side Note: There may be better ways to achieve this result without using regex.)

import re

label_pattern = (r"(?<=^)(article )(\d{1,2})((.)|(\n))+?(?=(^(article)|(\Z)))")

pattern = re.compile(label_pattern, flags=re.IGNORECASE | re.MULTILINE)

file_text = open('texte.txt').read()  # read the whole file

for i, match in enumerate(re.finditer(label_pattern, file_text, flags=re.IGNORECASE | re.MULTILINE)):
    print("MATCH %d:\n%s" % (i+1, match.group()))

The output is:

MATCH 1:
Article 1er - NOM
Latius iam disseminata licentia onerosus bonis omnibus Caesar nullum post haec adhibens modum orientis latera cuncta vexabat nec honoratis parcens nec urbium primatibus nec plebeiis.

MATCH 2:
Article 2 - ANNEE
Nemo quaeso miretur, si post exsudatos labores itinerum longos congestosque adfatim commeatus fiducia vestri ductante barbaricos pagos adventans velut mutato repente consilio ad placidiora deverti.

MATCH 3:
Article 3 - DATE Ego vero sic intellego, Patres conscripti, nos hoc tempore in provinciis decernendis perpetuae pacis habere oportere rationem.

Also, I assumed that there is a typo in your example text on the first line. You wrote "Article ler" but I think you meant "Article 1er" (Number 1 instead of letter l). If I don't make this change, you won't get the first article since the pattern is looking for "article" followed by 1 or 2 digits.

pault
  • 41,343
  • 15
  • 107
  • 149