1

I am having an issue with creating an XLSX document with the Open XML SDK 2.0 from MS.

My issue is that I need to display some of these illegal characters in the excel sheet, but if I just add them to the sheet, the document will not load.

I am using this function

    private static string ProcessString(string str)
    {
        return System.Security.SecurityElement.Escape(str);
    }

Which will give me Tom's ball instead of Tom's ball. (Well I haven't figured out how to get the latter as the excel generated won't open.)

Anybody know how to make the illegal XML characters show using OOXML in an Excel sheet?

EDIT:
In function I am using to create a text cell is:

private static Cell CreateTextCell(string header, UInt32 index, string text)
{
    var c = new Cell { DataType = CellValues.String, CellReference = header + index };
    var cellValue = new CellValue(text);
    c.Append(cellValue);
    return c;
}

I know it has to do with illegal characters because when I didn't include a particular field in my text it worked, then when I included it, Excel would give me a parser error and a blank document.

The text that I deal with also happens to have HTML tags in it as well.

ps. lol, I just noticed that the markdown used parsed my HTML escape making my example look ridiculous.

edit 2:

Some example of input:

  • Cancer's Complexity: Are we Looking at the Wrong Levels to Develop Effective Interventions?

  • Prospective study of breast cancer risk in mutation-negative women from <i>BRCA1</i> or <i>BRCA2</i> mutation-positive families in the Kathleen Cuningham Foundation Consortium for Research into Familial Breast Cancer (kConFab).

  • Germline <em>BRCA2</em> mutations correlate with aggressive prostate cancer and adverse outcome.

The html formatting is basically so it displays on the web page. I should just strip off the basic formatting tags. But more importantly, I want the excel file to load and escaping the values is a sure way of doing just that.

Min
  • 2,975
  • 1
  • 19
  • 24
  • sorry, what is the exact `str` value you are passing in? you can use the `
    your str
    ` tag to edit your post to show what it looks like.
    – Todd Main May 03 '11 at 21:31
  • Please show more of the relevant code. – Samuel Neff May 04 '11 at 15:30
  • can you use post a few examples of the html text you are passing in (what you are passing in, what you are expecting it to be and what is actually coming out the other end of your routine)? I'm having a really hard time understanding what the exact problem is you're dealing with. – Todd Main May 05 '11 at 16:00
  • are you looking to retain any of the HTML formatting like italic or bold or do you just want it gone all together? Another question, for your first example - why aren't you just inserting that directly? It should work fine. – Todd Main May 05 '11 at 16:33
  • I wouldn't even know how to retain any of the html formatting. That would be nice to have, but unnecessary. I'll definitely just try to escape the < and > and see what I get. – Min May 05 '11 at 17:24

3 Answers3

2

Think about that everything contained in e.g. an Excel-file is ultimately persisted as XML. So if you have any invalid XML-chars that you'd like to persist in the file, you need to excape them. It is not really an OOXML issue - it is an XML issue.

So if you have input text like

 Germline <em>BRCA2</em> mutations correlate with aggressive prostate cancer and adverse outcome.

You'd have to convert the angle brackets to

 Germline & lt;em& gt;BRCA2& lt;/em& gt; mutations correlate with aggressive prostate cancer and adverse outcome.

(space added to illustrate the brackets)

See String escape into XML for a few ways to do this.

Also, to see how Microsoft Office does this, add the trouble-some text to an Excel-document and save it. Then use the OpenXml SDK tool (comes with the SDK) to reflect the file and see how it is done.

Community
  • 1
  • 1
Jesper Lund Stocholm
  • 1,973
  • 2
  • 27
  • 49
1

Are you sure this is what is causing the problem? Can you add "normal" strings to the cells and open it?

AFAIK the apostrophe character is not an illegal XML character.

If you look in the OOXML specification in section 22.9.2.19 ST_Xstring (Escaped String) (the data type for strings in cells) you will see the following explanation:

*22.9.2.19 ST_Xstring (Escaped String) String of characters with support for escaped invalid-XML characters. For all characters which cannot be represented in XML as defined by the XML 1.0 specification, the characters are escaped using the Unicode numerical character representation escape character format xHHHH, where H represents a hexadecimal character in the character's value. [Example: The Unicode character 8 is not permitted in an XML 1.0 document, so it must be escaped as x0008. end example]*

Samuel Neff
  • 73,278
  • 17
  • 138
  • 182
Jesper Lund Stocholm
  • 1,973
  • 2
  • 27
  • 49
  • @Jesper Lund Stocholm: it's Min who asked the question, not Samuel. Good answer though! +1 – Todd Main May 04 '11 at 06:51
  • Hi Otaku, whoops - sorry about that. I just got confirmation about the possible illegality of the apostrophe character, and it is only an invalid XML character if is it put in an attribute value delimited by apostrophe characters themselves. https://twitter.com/#!/al3xbrown/status/65670385496113153 – Jesper Lund Stocholm May 04 '11 at 07:18
  • If he's inserting it proper or as a shared string, it wouldn't be an attribute. I'm just wondering how the `str` value is coming in. If it is actually **Tom's ball**, then I'm not sure why it's even getting escaped at all, it can just go in directly. – Todd Main May 04 '11 at 07:28
  • Hi Otaku, I am thinking the exact same thing. The string should go in without any trouble as it is. – Jesper Lund Stocholm May 04 '11 at 20:17
  • I added some to my posting. The text also happens to use html characters so it's not _just_ an apostrophe. – Min May 05 '11 at 15:31
  • How exactly do I escape a character within a string? \xHHHH? – Min May 05 '11 at 16:28
  • The apostrophe worked fine in the text. The < and > were the ones causing it to die. – Min May 05 '11 at 17:35
1

Another thing to be careful of. XML does not have built-in support for all of the character named entities that we are accustomed to using in HTML. There are a small number of always-understood ones in XML (i.e., &lt; &gt; &amp;). Since there are no DTDs in OOXML, you can't define more named entities. Instead you have to introduce anything else into a string using a character entity (e.g., &#x20; for a space, &#160; for &nbsp; etc.) or else directly enter the Unicode character into the string.

You can use &#034; for &quote; and &#039; for apostrophe if you need to get one of those into an attribute value that is surrounded by the same kind of quote symbol.

There are some Unicode code points that are flat-out forbidden in XML data streams. To insert those codes into the value of a string, OOXML has its own escape mechanism for those codes and those codes only. I don't think that provision is involved in this question.

orcmid
  • 2,618
  • 19
  • 20