0

I have a close to 800 MB file which consists of several (header followed by content). Header looks something like this M=013;X=rast;645.jpg while content is binary of the jpg file.

So the file looks something like this

M=013;X=rast;645.jpgNULœDüŠˆ.....M=217;X=rast;113.jpgNULÿñÿÿ&åbÿås....M=217;X=rast;1108.jpgNUL]_ÿ×ÉcË/...

The header can occur in one line or across two lines.

I need to parse this file and basically pop out the several jpg images.

Since this is too big a file, please suggest an efficient way? I was hoping to use StreamReader but do not have much experience with regular expressions to use with it.

Blorgbeard
  • 101,031
  • 48
  • 228
  • 272
blue piranha
  • 3,706
  • 13
  • 57
  • 98
  • Here's something to get you started: http://stackoverflow.com/questions/4273699/how-to-read-a-large-1-gb-txt-file-in-net as for reading the file, that's your job no? Perhaps post what you've tried & what errors you run into and we can help further. Otherwise you can always hire a developer to do it for you! – RandomUs1r Aug 18 '14 at 22:12
  • What do you need out of the header? And does the header always end with ".jpg" ? – James Curran Aug 18 '14 at 22:18
  • I wouldn't use regex for this. Maybe look into the jpg spec to see if you can extract a length from it.. – Blorgbeard Aug 18 '14 at 22:37
  • 1
    What do you mean - "across two lines"? Do you mean the header is divided by a carriage return and/or line feed? – barrypicker Aug 18 '14 at 22:52
  • 1
    If it were me I'd use something like `[^;]+\.jpg` in EditPadPro, which can handle multi-gigabyte files – zx81 Aug 19 '14 at 03:15

1 Answers1

1

RegEx:
/(M=.+?;X=.+?;.+?\.jpg)(.+?(?=(?1)|$))/gs *with recursion (not supported in .NET)

.NET RegEx workaround:
/(M=.+?;X=.+?;.+?\.jpg)(.+?(?=M=.+?;X=.+?;.+?\.jpg|$))/gs
replaced the (?1) recursion group with the contents inside the 1st capture group

Live demo and Explanation of RegExp: http://regex101.com/r/nQ3pE0/1

You'll want to use the 2nd capture group for binary contents, the 1st group will match the header and the expression needs it to know where to stop.

*edited in italic

CSᵠ
  • 10,049
  • 9
  • 41
  • 64
  • Thanks for your help. This doesn't seem to be a valid C# regex as it is giving me error - unrecognized grouping constructs. any thoughts? I am using something like this MatchCollection mc = Regex.Matches(input, @"(M=.+?;X=.+?;.+?\.jpg)(.+?(?=(?1)|$))", RegexOptions.Singleline); – blue piranha Aug 19 '14 at 04:35
  • Yes, edited the post, I wasn't aware .NET does not understand recursion... friends help at pointing out stuff :) See now if the new one helps – CSᵠ Aug 21 '14 at 22:05