0

The File I am trying to process looks like this:

...
...
15 Apr 2014 22:05 - id: content
15 Apr 2014 22:09 - id: content
15 Apr 2014 22:09 - id: content
with new line
16 Apr 2014 06:56 - id: content
with new line
with new line
16 Apr 2014 06:57 - id: content

16 Apr 2014 06:58 - id: content
...
...

the regex I have come up with is this: \d{1,}[ ][A-Z][a-z]{2}[ ](?:\d{4}[ ]\d{2}[:]\d{2}|\d{2}[:]\d{2}).*

which results in:

enter image description here

This is almost right i just need to include newline characters, but if i include this [\s\S]* instead of .* only one match is returned.

enter image description here

What i would like to extract is a set of substrings where each string starts at the data sequence and ends at the next date sequence like so:

...
...
15 Apr 2014 22:05 - id: content //substring 1
15 Apr 2014 22:09 - id: content //substring 2
15 Apr 2014 22:09 - id: content //substring 3
with new line                   //substring 3
16 Apr 2014 06:56 - id: content //substring 4
with new line                   //substring 4
with new line                   //substring 4
16 Apr 2014 06:57 - id: content //substring 5

16 Apr 2014 06:58 - id: content //substring 6
...
...

Any help to what im missing?

Ivan Bacher
  • 5,855
  • 9
  • 36
  • 56
  • If you're trying to get groups of dates and content, why use such a complicated regex, just splitting on two newlines to get groups, then on single newlines to get each line seems a lot easier ? – adeneo Feb 22 '15 at 15:47
  • YES, but there could still be content on the next line that belongs to the previous one – Ivan Bacher Feb 22 '15 at 15:51

2 Answers2

2

You need to use a positive lookahead assertion.

\d{1,}[ ][A-Z][a-z]{2}[ ](?:\d{4}[ ]\d{2}[:]\d{2}|\d{2}[:]\d{2})[\s\S]*?(?:(?!\n\n)[\s\S])*?(?=\n\d{1,}[ ])|\d{1,}[ ][A-Z][a-z]{2}[ ](?:\d{4}[ ]\d{2}[:]\d{2}|\d{2}[:]\d{2}).*

DEMO

> var str = '...\n...\n15 Apr 2014 22:05 - id: content\n15 Apr 2014 22:09 - id: content\n15 Apr 2014 22:09 - id: content\nwith new line\n16 Apr 2014 06:56 - id: content\nwith new line\nwith new line\n16 Apr 2014 06:57 - id: content\n\n16 Apr 2014 06:58 - id: content\n...\n...';
undefined
> var re = /\d{1,}[ ][A-Z][a-z]{2}[ ](?:\d{4}[ ]\d{2}[:]\d{2}|\d{2}[:]\d{2})[\s\S]*?(?:(?!\n\n)[\s\S])*?(?=\n\d{1,}[ ])|\d{1,}[ ][A-Z][a-z]{2}[ ](?:\d{4}[ ]\d{2}[:]\d{2}|\d{2}[:]\d{2}).*/gm;
undefined
> str.match(re)
[ '15 Apr 2014 22:05 - id: content',
  '15 Apr 2014 22:09 - id: content',
  '15 Apr 2014 22:09 - id: content\nwith new line',
  '16 Apr 2014 06:56 - id: content\nwith new line\nwith new line',
  '16 Apr 2014 06:57 - id: content\n',
  '16 Apr 2014 06:58 - id: content' ]
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Thx, just a slight change: `(\d{1,}[ ][A-Z][a-z]{2}[ ](?:\d{4}[ ]\d{2}[:]\d{2}|\d{2}[:]\d{2})[\s\S]*?(?:(?!)[\s\S])*?(?=\d{1,}[ ])|\d{1,}[ ][A-Z][a-z]{2}[ ](?:\d{4}[ ]\d{2}[:]\d{2}|\d{2}[:]\d{2})[\s\S]*)` Demo: https://regex101.com/r/cS7sB7/1 – Ivan Bacher Feb 22 '15 at 16:05
-1

See the second answer here: How to use JavaScript regex over multiple lines?

Try using the non-greedy quantifier [\s\S]? like that and see what it returns. Alternatively, just get back one output and split the whole string on newlines afterwards...

Community
  • 1
  • 1
  • its better to post the answer and add the link as reference. link content may be removed. – Razib Feb 22 '15 at 15:59