0

I'm trying to work out the best way to extract chunks of base64 out of a file containing both plain text and base64

Say I have the string

Subject: Fwd: Test.
Thread-Topic: Test.
Date: Tue, 5 May 2020 19:02:42 +0000

U3ViamVjdCB0byBiYXNlNjQgZGVjb2Rl

--_000_DB6PR10MB1831AAD962E88A95B21547589EA70DB6PR10MB1831EURP_
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64

IExvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0LCBjb25zZWN0ZXR1ciBhZGlwaXNjaW5nIGVsaXQu
IEludGVnZXIgc2VtIG51bGxhLCB0aW5jaWR1bnQgZXUgdmVuZW5hdGlzIHNlZCwgZWdlc3RhcyBz
ZWQgcmlzdXMuIEZ1c2NlIG5vbiBkb2xvciBmZWxpcy4gTnVuYyB2aXRhZSBuaXNsIG1vbGVzdGll
LCBtb2xsaXMgbWFzc2EgZXQsIGVsZWlmZW5kIHB1cnVzLiBQcm9pbiBhIGFsaXF1ZXQgZXJhdC4g
Q3JhcyB2ZWhpY3VsYSBtb2xlc3RpZSBlbGl0IGFjIHByZXRpdW0uIE5hbSBhIGxlbyBmcmluZ2ls
bGEsIGdyYXZpZGEgbGVvIHNpdCBhbWV0LCBvcm5hcmUgYXVndWUuIE51bGxhbSBmYWNpbGlzaXMs
IGxlbyBldCBydXRydW0gaGVuZHJlcml0LA==    

--_000_DB6PR10MB1831AAD962E88A95B21547589EA70DB6PR10MB1831EURP_--

Mail Retrieved

I would expect the output to be the following strings

U3ViamVjdCB0byBiYXNlNjQgZGVjb2Rl

IExvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0LCBjb25zZWN0ZXR1ciBhZGlwaXNjaW5nIGVsaXQu
IEludGVnZXIgc2VtIG51bGxhLCB0aW5jaWR1bnQgZXUgdmVuZW5hdGlzIHNlZCwgZWdlc3RhcyBz
ZWQgcmlzdXMuIEZ1c2NlIG5vbiBkb2xvciBmZWxpcy4gTnVuYyB2aXRhZSBuaXNsIG1vbGVzdGll
LCBtb2xsaXMgbWFzc2EgZXQsIGVsZWlmZW5kIHB1cnVzLiBQcm9pbiBhIGFsaXF1ZXQgZXJhdC4g
Q3JhcyB2ZWhpY3VsYSBtb2xlc3RpZSBlbGl0IGFjIHByZXRpdW0uIE5hbSBhIGxlbyBmcmluZ2ls
bGEsIGdyYXZpZGEgbGVvIHNpdCBhbWV0LCBvcm5hcmUgYXVndWUuIE51bGxhbSBmYWNpbGlzaXMs
IGxlbyBldCBydXRydW0gaGVuZHJlcml0LA==

I've created a regex which creates the desired match

^\n([a-zA-Z0-9+\/=\n]*)\n$

But the following in c# returns no matches

var test1 = Regex.Matches(input, @"^\r\n([a-zA-Z0-9+/=\n]*)\r\n$");    
var test2 = Regex.Matches(input, @"^\n([a-zA-Z0-9+/=\n]*)\n$");

Whilst I can fix the regex, I'm now wondering if there's a more effecient way of achieving this. Additionally, some of the input strings will be rather large.

atoms
  • 2,993
  • 2
  • 22
  • 43
  • You appear to be reinventing the wheel. See marked duplicates for a couple of the many existing questions on SO that address parsing MIME-formatted content. If those don't address your need, fix your question so that it isn't so broad and explain exactly why MIME-parsing doesn't address it. The regex approach you've tried looks like a complete non-starter to me, given that there are lots of examples of text that could match a base64-passing regex even though the text isn't itself intended as base64. You need to respect the frame boundaries in the text you're trying to parse. – Peter Duniho May 08 '20 at 20:00
  • yeah your totally right. Thanks @PeterDuniho! I was going to check each string from the loose match to see if they were divisable by 4 and then try parsing them. Thanks for you time – atoms May 08 '20 at 20:01

0 Answers0