2

I've been recently learning about regular expressions. I'm trying to gather FDF objects into individual strings, which I can then parse. The problem I'm having is that my code only matches the first occurrence and all other "objects" in the FDF file are ignored.

Objects begin on their own line with 2 numbers and the string "obj," and then a carriage return (not a line feed). They end after a carriage return and the string "endobj".

//testing parsing into objects...
List<String> FDFobjects = new List<String>();

String strRegex = @"^(?<obj>\d+ \d+) obj\r(?<objData>.+?)\rendobj(?=\r)";
Regex useRegex = new Regex(strRegex, RegexOptions.Multiline | RegexOptions.Singleline);

StreamReader reader = new StreamReader(FileName);
String fdfString = reader.ReadToEnd();
reader.Close();

foreach (Match useMatch in useRegex.Matches(fdfString))
    FDFobjects.Add(useMatch.Groups["objData"].Value);

if (FDFobjects.Count > 0)
    Console.WriteLine(FDFobjects[0]);

Console.WriteLine(FDFobjects.Count);

(I was using $ at the end of the regex string, but that matches 0 times, whereas using (?=\r) matches once.)

Edit: Some line returns are CR/LF, and some are just CR. I don't know if it's always consistent for the different parts of the file, so I just check for all of them. I've settled on the following, which seems to work perfectly so far (and I'm not using the Multiline option). Adding the look behind is what made the biggest difference here....

... = new Regex(@"(?<=^|[^\\](\r\n|\r|\n))(?<objName>\d+ \d+) obj(\r\n|\r|\n)(?<objData>.*?)(?<!\\)(\r\n|\r|\n)endobj(?=\r\n|\r|\n|$)", RegexOptions.Singleline);
someprogrammer
  • 229
  • 2
  • 13
  • Try `@"^(?\d+ \d+) obj\r?\n(?.+?)\r?\nendobj(?=\r?\n)"`. Maybe changing `\r` to a more flexible `\r?\n` can help. Without an exact sample string, it is not easy to help you with this pattern. – Wiktor Stribiżew Sep 21 '16 at 20:04
  • @Wiktor: Thanks. It doesn't work. The FDF is using carriage return only, it appears. – someprogrammer Sep 21 '16 at 20:08
  • 1
    Then provide the exact input string with exact expected output. – Wiktor Stribiżew Sep 21 '16 at 20:08
  • I cannot convince myself that using a regex to parse FDF data is going to be 100% reliable. What if the data contains the string "endobj" at the end of a line? – Andrew Morton Sep 21 '16 at 20:13
  • @Andrew: That's why I check that the "endobj" string is on it's own line. It's preceded by a \r. – someprogrammer Sep 21 '16 at 20:15
  • @someprogrammer That is still not reliable. The objects can contain arbitrary data - that is why there is a data length for the objects. All you have to do is get the specification for the format of FDF files and parse it appropriately. – Andrew Morton Sep 21 '16 at 20:19
  • @Andrew: You might be right. But, I believe the object would not contain an unescaped carriage return. I could be wrong. – someprogrammer Sep 21 '16 at 20:20
  • Even if `objData` does contain an extra `\r`, I would expect it to be adequately (lazily) matched by the `.+?`. – dahlbyk Sep 21 '16 at 20:27
  • Anyway, instead of `(?=\r)`, you need to use `(?=\r?$)`, that is the best way to match the end of the line/string with multiline option on. – Wiktor Stribiżew Sep 21 '16 at 20:28
  • @someprogrammer Please see [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/a/1758162/1115360) and change "HTML" to "FDF". – Andrew Morton Sep 21 '16 at 20:28
  • ...changed the end to (?=\r|$), but it still only matches the first one – someprogrammer Sep 21 '16 at 20:32
  • I've fixed it now by starting the regex string with (?<=^|\r). Thanks for your comments. – someprogrammer Sep 21 '16 at 20:35

2 Answers2

0

The ^ in your pattern is only going to match at the start of the string. Try \b instead.

dahlbyk
  • 75,175
  • 8
  • 100
  • 122
  • 2
    The first object is not at the start of the string and it matches. The RegexOptions.Multiline option is supposed to change the matching of ^ and $. – someprogrammer Sep 21 '16 at 20:06
  • Good point... I've never tried mixing `Singleline` and `Multiline` before - do you really need both? – dahlbyk Sep 21 '16 at 20:15
  • 1
    I hear you. The unfortunately named "Singleline" and "Multiline" options are unrelated. "Singleline" has to do with whether the dot matches new lines or not. – someprogrammer Sep 21 '16 at 20:18
0

It seems that MSDN Regex Web help is lying about what ^ matches:

^  -   Matches the position at the start of the searched string. If the m (multiline search) character is included with the flags, ^ also matches the position following \n or \r.

It only matches the position after \n, see the following demo with the @"(?m)^\d+" pattern matching 1, 2, 4 in the "1\r\n2\r3\n4" input (3 is preceded with \r).

Use (?<=\r|^) at the beginning and (?=\r|$) at the end:

var s = "1 2 obj\rObj1\rendobj\r2 3 obj\rObj2\rendobj\r3 45 obj\rObj3\rendobj";
var matches = Regex.Matches(s, @"(?<=\r|^)(?<obj>\d+ \d+) obj\r(?<objData>.+?)\rendobj(?=\r|$)",
        RegexOptions.Multiline | RegexOptions.Singleline);
foreach (Match m in matches)
{
    Console.WriteLine("___ MATCH ___");
    Console.WriteLine(m.Value);
}

Outputs all 3 matches:

___ MATCH ___
1 2 obj
Obj1
endobj
___ MATCH ___
2 3 obj
Obj2
endobj
___ MATCH ___
3 45 obj
Obj3
endobj

See the C# demo online.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563