2

I have a txt file that I converted from a pdf that contains a long list of items. These items have a numbering convention as follows:

[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}

This expression would match something between:

A1.1.1

and

ZZ99.99.99

This works just fine. The issue I am having is that I am trying to capture this in group 1 and everything between each item number (the item description) in group 2.

I also need these returned as a list or an iterable so that, eventually, the contents captured can be exported to an excel spreadsheet.

This is the regex I have currently:

^([A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}\s)([\w\W]*?)(?:\n)

Follow this link to find a sample of what I have and the issues I am facing:

Debuggex Demo

Is anyone able to help me figure out how to capture everything between each number no matter how many paragraphs?

Any input would be greatly appreciated, thanks!

  • I don't know Python, but i had a similar [question](https://stackoverflow.com/questions/46331543/use-regex-to-split-numbered-list-array-into-numbered-list-multiline) recently. And this is the [regex101 demo](https://regex101.com/r/WpiKin/3). Hope it helps – danieltakeshi Oct 10 '17 at 19:54

1 Answers1

0

You are very close:

import re

s = """
A1.2.1 This is the first paragraph of the description that is being captured by the regex even if the description contains multiple lines of text.ZZ99.99.99
"""
final_data = re.findall("[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}(.*?)[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}", s)

Output:

[' This is the first paragraph of the description that is being captured by the regex even if the description contains multiple lines of text.']

By using (.*?) you can match any text between the letters and numbers as defined by your first regex.

Ajax1234
  • 69,937
  • 8
  • 61
  • 102