0

I have an Outlook add-in that will take a MailItem save its attachments and html content to where it can be viewed as a web page. The problem is, Outlook appends 2 sets of hex codes to every attachment, here's an example.

<img width=700 height=119 id="_x0000_i1032" src="http://somesite/img/didyouknow/image001.jpg@01CD34FA.041E5EE0" alt="diduknow_header.gif">

What would be the cleanest way to remove the 01CD34FA.041E5EE0 from above for all images?

Marshall
  • 1,095
  • 1
  • 12
  • 17

2 Answers2

0

Simple: Since you're getting a full XML document from outlook load it first into an XmlDocument

XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(html);
string imgsrc = xmlDoc["img"].Attributes["src"].InnerText; //I'm just guessing here without the full XML

imgsrc = imgsrc.Substring(0, imgsrc.LastIndexOf('@'));

Might want to do error checking since this will raise an exception if there's no @ sign in the string.

Mataniko
  • 2,212
  • 16
  • 18
  • I was hoping for something that could handle an entire html document. – Marshall Dec 04 '12 at 21:22
  • You can load the html into an XmlDocument and locate it easily with: string imgsrc = xmlDoc["img"].Attributes["src"].InnerText; – Mataniko Dec 04 '12 at 21:23
  • I thought about that but it seems really brittle, should I be trusting Outlook/the user sending these emails to be generating html valid enough to go this route? – Marshall Dec 04 '12 at 21:26
  • It's not clear by your question where the XML is coming from, but if outlook is generating these, you should be pretty confident that they have a rigid scheme that will be consistent. – Mataniko Dec 04 '12 at 21:28
  • You can always use a faster xmlreader and iterate through it to find image elements and process that, but this shouldn't be a major improvement over smaller XML documents – Mataniko Dec 04 '12 at 21:29
  • Yeah, I don't think it's gonna fly. This is a sample of what Outlook generates... http://pastebin.com/R0GEAsLg I really think RegEx is probably the way to go. – Marshall Dec 04 '12 at 21:34
  • You can use the following regex to get all the IMG elements: () I would then just do a simple string replacement otherwise I believe you need lookaheads to isolate just the src="" part. (I'm assuming it could appear somewhere else in the document not just an img tag) – Mataniko Dec 04 '12 at 22:08
  • For some extra info: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Mataniko Dec 04 '12 at 22:12
0

Try searching for this pattern:

(src\=\".*?\.jpg)([^\"]+)(\")

And replace with

$1$3

In code it'd be:

string input = File.ReadAllText("path/to/the/outlook.mess");
string pattern = @"(src\=\"".*?\.jpg)([^\""]+)(\"")";
string cleanOutput = Regex.Replace(input, pattern, "$1$3");
File.WriteAllText("/path/to/the/outlook.clean", cleanOutput);

Note that it's needed to repeat double quotes twice in an at-quoted string, to have the effect of a single quote.

Sina Iravanian
  • 16,011
  • 4
  • 34
  • 45