2

I was able to use this question as a starting point in parsing an "mht" file but the "3D" in the anchor tags (e.g.: <a href=3D"[my anchor]">[anchor text]></a>) breaks all the internal links and embedded images. I can have the parser replace "=3D" with just "=" (e.g.: <a href="[my anchor]">[anchor text]></a>) and it appears to work fine but I want to understand the purpose of that "meta markup".

Why does exporting from ".docx" to ".mht" add "3D" to the right-hand sides of most (if not all) of the html attributes? Is there a better way to handle them or a better regex to use when replacing them?

Community
  • 1
  • 1
Jesse Hautala
  • 134
  • 10

1 Answers1

3

The =3D is a result of quoted printable encoding. It shouldn't be too hard to find a java library for decoding quoted printable data.

Geoff Reedy
  • 34,891
  • 3
  • 56
  • 79
  • Specifically, the regex I described in my question replaces a quoted printable encoding of '=' with '='. I'm going to look into decoding other quoted printable encoded characters. Thanks again! – Jesse Hautala Aug 29 '12 at 16:58