0

I am using the below regexp successfully to read between my tags until I reach a case where there is a < sign embedded in my data between the tags. To fix this I want to read between a +> and a </+. There is no way that combination would be used in the database I'm pulling from. When I try to change the code below to do this I get stuck. Have any ideas?

Code:

@fieldValues =  $inFileLine =~ m(>([^<]+)<)g;

My sorry attempt at modifying the code:

@fieldValues =  $inFileLine =~ m(\+>([^<\/\+]+)<\/\+)g;

Data:

<+RecordID+>SWCR000111</+RecordID+><+Title+>My Title Is < Than Yours</+Title+>
Matt
  • 69
  • 9
  • Is there supposed to be an embedded `<` somewhere in the sample data? The traditional way to handle this problem is to encode your entities: `<` – Matt Jacob Nov 19 '15 at 21:43
  • @Matt, my description of the problem is getting cut off. Sorry about that. I am hoping to fix my "sorry attempt" so it reads between a +> and a +. – Matt Nov 19 '15 at 21:45
  • Right now my code is just reading between a > and a < which works 99.9 % of the time except in the instance where a title is entered like this "My title is < than yours". – Matt Nov 19 '15 at 21:47
  • I don't see a `<` imbedded in the content between tags. –  Nov 19 '15 at 21:49
  • Sorry about that SLN. I just fixed it. – Matt Nov 19 '15 at 21:50
  • Please check [this demo](http://ideone.com/4FbuRt) - does it work for you? Also, this can work, too: [`\+>(?!<\+)([^<]*(?:<(?!\/\+)[^<]*)*)<\/\+`](https://regex101.com/r/kC4hP8/2). – Wiktor Stribiżew Nov 19 '15 at 22:04
  • 1
    ... yurk, this is why in `XML` embedding `<` is disallowed. Of course, neither is this actually `XML` either, so you're getting the worst of both worlds. – Sobrique Nov 19 '15 at 22:11
  • Sobrique - I know this isn't ideal I guess I could come up with another tagging scheme...but Im sure it wouldn't be much prettier. – Matt Nov 20 '15 at 00:20
  • @stribizhev - thank you for the code and awesome website. I was poking around on that site and it's filled with great information. As a new PERL coder my head is spinning. I had no idea the regexp for what I needed was so complicated. Sobrique is right that I need to simplify the data output so I can parse it easier. Thanks for the lesson and help! -Matt – Matt Nov 20 '15 at 00:33
  • @stribizhev - the 2nd one actually worked better. I had the first one fail on a record for a reason I haven't yet determined. But the 2nd one seems to be getting through fine. Thanks again! – Matt Nov 20 '15 at 01:54
  • So shall I post it or is sln's answer that works best? I ask because you accepted his answer that is based on tempered greedy token). – Wiktor Stribiżew Nov 20 '15 at 06:40
  • @stribizhev - Right now I am using your second suggestion (\+>(?!<\+)([^<]*(?:<(?!\/\+)[^<]*)*)<\/\+.). Your first suggestion worked on a small dataset but failed on a larger one probably due to another embedded > or +. I haven't root-caused it yet. However your second suggestion has worked without failure yet. SLNs suggestion worked on a small dataset but I have not had a chance yet to push a large dataset through it yet. – Matt Nov 20 '15 at 13:23
  • Thank you for the update. `(?!<\+)` in the 2nd suggestion just makes sure the `+>` is not followed with `<+`. If this rule is universal in your base, I will post with all explanations. – Wiktor Stribiżew Nov 20 '15 at 13:27
  • @stribizhev - That rule is universal. The only way that could happen is if the filed I am pulling is blank and I am protecting against that. I'm not pulling any fields that are blank. I really appreciate you taking the time out of your busy day to answer my question. Thank you so much! Also thank you for that site where I can play with the regular expressions and see the results in real time. Totally cool! – Matt Nov 20 '15 at 13:37

2 Answers2

1

Since it works for you as the +> cannot be followed with <+, I am posting my comment as an answer.

This regex should be safe to use even with very large files:

\+>(?!<\+)([^<]*(?:<(?!\/\+)[^<]*)*)<\/\+

See regex demo

Here is what it is doing:

  • \+>(?!<\+) - matches +> (with \+>) that is not followed with <+ (due to the negative lookahead (?!<\+))
  • ([^<]*(?:<(?!\/\+)[^<]*)*) - matches and stores in Group 1
    • [^<]* - 0 or more characters other than < followed by...
    • (?:<(?!\/\+)[^<]*)* - 0 or more sequences of...
      • <(?!\/\+) - < that is not followed by /+ and then
      • [^<]* - 0 or more characters other than <
  • <\/\+ - matches the final </+

In short, this is the same as \+>(?!<\+)([\s\S]*?)<\/\+, but "unwrapped" using the unrolling-the-loop technique to allow large portions of text in-between the delimiters (that is, between +> and the closest </+).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Awesome. I'm going to get a cup of coffee and study this. You as well as all others that responded are genius! My script is processing very large amounts of text between the delimiters for a few of the fields in my record. Probably explains why the second option worked better. Thank you! – Matt Nov 20 '15 at 13:48
  • If you want to study unroll the loop technique, here is [one link](http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop) and an [SO post](http://stackoverflow.com/questions/17043454/using-regexes-how-to-efficiently-match-strings-between-double-quotes-with-embed). – Wiktor Stribiżew Nov 20 '15 at 13:54
0

update: Since you are just looking for simple, you don't have to
go beyond the definition of tag delimiters.
This is because you don't parse with a definition of a tag at all.

The solution boils down to this very simple regex -

Find: <(?!/?\+)
Replace: &lt;


If you want to proceed with a misconception that +> .. </+ delineates
something between tags, this is the original.


Typically it's done with negative assertions on a character by character basis.

m{\+>((?:(?!\+>|</\+).)*<(?:(?!\+>|</\+).)*)</\+}s

Formatted:

 \+>
 (                             # (1 start)
      (?:
           (?! \+> | </\+ )
           . 
      )*
      <
      (?:
           (?! \+> | </\+ )
           . 
      )*
 )                             # (1 end)
 </\+

Output:

 **  Grp 0 -  ( pos 42 , len 29 ) 
+>My Title Is < Than Yours</+  
 **  Grp 1 -  ( pos 44 , len 24 ) 
My Title Is < Than Yours  
  • Thanks SLN. This is really cool and a great lesson. A little over my head still but I am goggling :) Thank you sir! – Matt Nov 20 '15 at 00:45
  • With the best will in the world - whilst it is a valid answer to your problem - it will make future maintenance programmers really hate you. – Sobrique Nov 20 '15 at 09:56
  • I agree that these solutions are terribly complicated but Im not sure I have an alternative. I'm talking with a clearquest database and need to request and push data in a very specific way through their business logic layer. I can't go directly to the SQL server database. And I'm also not allowed to install PERL packages above what is already there do to extreme security standards, red tape, and politics. – Matt Nov 20 '15 at 13:27
  • @Matt - I posted a terribly _simple_ solution for ya. –  Nov 20 '15 at 17:16