Regexp to read to a plus sign

Question

I am using the below regexp successfully to read between my tags until I reach a case where there is a < sign embedded in my data between the tags. To fix this I want to read between a +> and a </+. There is no way that combination would be used in the database I'm pulling from. When I try to change the code below to do this I get stuck. Have any ideas?

Code:

@fieldValues =  $inFileLine =~ m(>([^<]+)<)g;

My sorry attempt at modifying the code:

@fieldValues =  $inFileLine =~ m(\+>([^<\/\+]+)<\/\+)g;

Data:

<+RecordID+>SWCR000111</+RecordID+><+Title+>My Title Is < Than Yours</+Title+>

Is there supposed to be an embedded `<` somewhere in the sample data? The traditional way to handle this problem is to encode your entities: `<` — Matt Jacob, Nov 19 '15 at 21:43
@Matt, my description of the problem is getting cut off. Sorry about that. I am hoping to fix my "sorry attempt" so it reads between a +> and a +. — Matt, Nov 19 '15 at 21:45
Right now my code is just reading between a > and a < which works 99.9 % of the time except in the instance where a title is entered like this "My title is < than yours". — Matt, Nov 19 '15 at 21:47
Please check [this demo](http://ideone.com/4FbuRt) - does it work for you? Also, this can work, too: [`\+>(?!<\+)([^<]*(?:<(?!\/\+)[^<]*)*)<\/\+`](https://regex101.com/r/kC4hP8/2). — Wiktor Stribiżew, Nov 19 '15 at 22:04
... yurk, this is why in `XML` embedding `<` is disallowed. Of course, neither is this actually `XML` either, so you're getting the worst of both worlds. — Sobrique, Nov 19 '15 at 22:11
Sobrique - I know this isn't ideal I guess I could come up with another tagging scheme...but Im sure it wouldn't be much prettier. — Matt, Nov 20 '15 at 00:20
@stribizhev - thank you for the code and awesome website. I was poking around on that site and it's filled with great information. As a new PERL coder my head is spinning. I had no idea the regexp for what I needed was so complicated. Sobrique is right that I need to simplify the data output so I can parse it easier. Thanks for the lesson and help! -Matt — Matt, Nov 20 '15 at 00:33
@stribizhev - the 2nd one actually worked better. I had the first one fail on a record for a reason I haven't yet determined. But the 2nd one seems to be getting through fine. Thanks again! — Matt, Nov 20 '15 at 01:54
So shall I post it or is sln's answer that works best? I ask because you accepted his answer that is based on tempered greedy token). — Wiktor Stribiżew, Nov 20 '15 at 06:40
@stribizhev - Right now I am using your second suggestion (\+>(?!<\+)([^<]*(?:<(?!\/\+)[^<]*)*)<\/\+.). Your first suggestion worked on a small dataset but failed on a larger one probably due to another embedded > or +. I haven't root-caused it yet. However your second suggestion has worked without failure yet. SLNs suggestion worked on a small dataset but I have not had a chance yet to push a large dataset through it yet. — Matt, Nov 20 '15 at 13:23
Thank you for the update. `(?!<\+)` in the 2nd suggestion just makes sure the `+>` is not followed with `<+`. If this rule is universal in your base, I will post with all explanations. — Wiktor Stribiżew, Nov 20 '15 at 13:27
@stribizhev - That rule is universal. The only way that could happen is if the filed I am pulling is blank and I am protecting against that. I'm not pulling any fields that are blank. I really appreciate you taking the time out of your busy day to answer my question. Thank you so much! Also thank you for that site where I can play with the regular expressions and see the results in real time. Totally cool! — Matt, Nov 20 '15 at 13:37

score 1 · Accepted Answer · answered Nov 20 '15 at 13:45

Since it works for you as the +> cannot be followed with <+, I am posting my comment as an answer.

This regex should be safe to use even with very large files:

\+>(?!<\+)([^<]*(?:<(?!\/\+)[^<]*)*)<\/\+

See regex demo

Here is what it is doing:

\+>(?!<\+) - matches +> (with \+>) that is not followed with <+ (due to the negative lookahead (?!<\+))
([^<]*(?:<(?!\/\+)[^<]*)*) - matches and stores in Group 1
- [^<]* - 0 or more characters other than < followed by...
- (?:<(?!\/\+)[^<]*)* - 0 or more sequences of...
  - <(?!\/\+) - < that is not followed by /+ and then
  - [^<]* - 0 or more characters other than <
<\/\+ - matches the final </+

In short, this is the same as \+>(?!<\+)([\s\S]*?)<\/\+, but "unwrapped" using the unrolling-the-loop technique to allow large portions of text in-between the delimiters (that is, between +> and the closest </+).

Awesome. I'm going to get a cup of coffee and study this. You as well as all others that responded are genius! My script is processing very large amounts of text between the delimiters for a few of the fields in my record. Probably explains why the second option worked better. Thank you! — Matt, Nov 20 '15 at 13:48
If you want to study unroll the loop technique, here is [one link](http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop) and an [SO post](http://stackoverflow.com/questions/17043454/using-regexes-how-to-efficiently-match-strings-between-double-quotes-with-embed). — Wiktor Stribiżew, Nov 20 '15 at 13:54

score 0 · Answer 2 · 2015-11-20T17:15:52.103

0

update: Since you are just looking for simple, you don't have to
go beyond the definition of tag delimiters.
This is because you don't parse with a definition of a tag at all.

The solution boils down to this very simple regex -

Find: <(?!/?\+)
Replace: <

If you want to proceed with a misconception that +> .. </+ delineates
something between tags, this is the original.

Typically it's done with negative assertions on a character by character basis.

m{\+>((?:(?!\+>|</\+).)*<(?:(?!\+>|</\+).)*)</\+}s

Formatted:

 \+>
 (                             # (1 start)
      (?:
           (?! \+> | </\+ )
           . 
      )*
      <
      (?:
           (?! \+> | </\+ )
           . 
      )*
 )                             # (1 end)
 </\+

Output:

 **  Grp 0 -  ( pos 42 , len 29 ) 
+>My Title Is < Than Yours</+  
 **  Grp 1 -  ( pos 44 , len 24 ) 
My Title Is < Than Yours

edited Nov 20 '15 at 17:15

answered Nov 19 '15 at 21:56

Thanks SLN. This is really cool and a great lesson. A little over my head still but I am goggling :) Thank you sir! – Matt Nov 20 '15 at 00:45
With the best will in the world - whilst it is a valid answer to your problem - it will make future maintenance programmers really hate you. – Sobrique Nov 20 '15 at 09:56
I agree that these solutions are terribly complicated but Im not sure I have an alternative. I'm talking with a clearquest database and need to request and push data in a very specific way through their business logic layer. I can't go directly to the SQL server database. And I'm also not allowed to install PERL packages above what is already there do to extreme security standards, red tape, and politics. – Matt Nov 20 '15 at 13:27
@Matt - I posted a terribly _simple_ solution for ya. – Nov 20 '15 at 17:16

Regexp to read to a plus sign

2 Answers2