Split a huge text using regex delimiters

Question

I'm working with giant text files that have more than one document inside. These documents have a very similar interface, with fixed fields and dynamic values. I need to separate these documents in arrays.

Example:

[
   [] <- Doc1
   [] <- Doc2
   [] <- Doc3
   [] <- Doc4
   ...
   ...
   ...
]

For this, I need to create a regular expression that defines the delimiter, where the doc starts and where ends.

Example:

DOC_START
TEXT
TEXT
TEXT
TEXT
DOC_FINAL
DOC_START
TEXT
TEXT
TEXT
TEXT
DOC_FINAL

REGEX: ((?:DOC_START)(?:[\S\S]+)(?:DOC_FINAL)?)

The question is: Some documents may have peculiarities, starting or ending with a something a bit different, so I need to be able to pass start and end options.

My question: how can I do this? And how can I also improve the regex?

Just to be clear, sometimes, the document may have the beginning or the ending a bit different. Example:

DOC_START
TEXT
TEXT
TEXT
TEXT
DOC_FINAL
DOC_START
TEXT
TEXT
TEXT
TEXT
DOC_FINAL
OTHER_START
TEXT
TEXT
TEXT
TEXT
DOC_FINAL
DOC_START
TEXT
TEXT
TEXT
TEXT
OTHER_FINAL
OTHER_START
TEXT
TEXT
TEXT
TEXT
OTHER_FINAL

Just a remark, not a solution: your expression is likely to be [`[\s\S]+?`](https://regex101.com/r/mS9uD1/1) - mind the lowercase `\s` and the lazy operator (`+?`). — Jan, Jun 08 '16 at 15:36

score 0 · Answer 1 · answered Jun 08 '16 at 15:37

It would be better not to use regex, especially with large documents. Use indexOf():

 var hugeDoc = 'DOC_STARTxxDOC_ENDOTHER_STARTyyOTHER_END'; 
        var result = [];
        var start =0;

        var possibleDelimiters = [ 
                {'start': 'OTHER_START', 'end':'OTHER_END'},
                {'start': 'DOC_START', 'end':'DOC_END'}
        ];

        function parseDoc(delimiter) {
                var end = hugeDoc.indexOf(delimiter.end, start);
                if(!end) return false;
                result.push(hugeDoc.slice(start+delimiter.start.length, end));
                //add +1 here, if you have a new line after DOC_END
                start = end+delimiter.end.length;
                return true;
        }

        do {
                var found = false;
                for(ix in possibleDelimiters) {
                        var delimiter = possibleDelimiters[ix];
                        if(hugeDoc.indexOf(delimiter.start, start) === start) {
                                found = parseDoc(delimiter) || found;
                        }
                }
        } while(found);

var node = document.getElementById('result');
node.innerHTML = JSON.stringify(result);

<html>
  <body>
    <div id="result"></div>
    </body>
</html>

score 0 · Answer 2 · edited May 23 '17 at 10:29

First I believe you have a typo in your regex it should be [\s\S] instead of [\S\S] notice the lower-case s. This correctly matches accross lines.

This regex could accomplish what you need for matching such a document, someone could probably make a more optimized version:

/(?:DOC_START|OTHER_START)([\s\S]*?)(?:DOC_FINAL|OTHER_FINAL)/g

On the other hand I would rather suggest you do this with a different approach if possible. For example if you're doing this within NodeJS I'd strongly suggest you do a check per line for the DOC_START or DOC_END delimiters. Then fill the array with lines until the ending delimiter.

Assuming that you want an array of lines in each document, loose pseudo code following:

create resulting object ({ doc1: null })
read line
if start delimiter
  if current object property is null
    create array (doc#: [])
else if end delimiter
  create new doc property (doc2: null)
else
  add line to array

Another note if you're doing this with HTML I'd strongly suggest not to use regex at all as HTML is not a regular language :) you'll find many links on SO pointing to evil.

Split a huge text using regex delimiters

2 Answers2