I'm working with giant text files that have more than one document inside. These documents have a very similar interface, with fixed fields and dynamic values. I need to separate these documents in arrays.
Example:
[
[] <- Doc1
[] <- Doc2
[] <- Doc3
[] <- Doc4
...
...
...
]
For this, I need to create a regular expression that defines the delimiter, where the doc starts and where ends.
Example:
DOC_START
TEXT
TEXT
TEXT
TEXT
DOC_FINAL
DOC_START
TEXT
TEXT
TEXT
TEXT
DOC_FINAL
REGEX: ((?:DOC_START)(?:[\S\S]+)(?:DOC_FINAL)?)
The question is: Some documents may have peculiarities, starting or ending with a something a bit different, so I need to be able to pass start and end options.
My question: how can I do this? And how can I also improve the regex?
Just to be clear, sometimes, the document may have the beginning or the ending a bit different. Example:
DOC_START
TEXT
TEXT
TEXT
TEXT
DOC_FINAL
DOC_START
TEXT
TEXT
TEXT
TEXT
DOC_FINAL
OTHER_START
TEXT
TEXT
TEXT
TEXT
DOC_FINAL
DOC_START
TEXT
TEXT
TEXT
TEXT
OTHER_FINAL
OTHER_START
TEXT
TEXT
TEXT
TEXT
OTHER_FINAL