Replacing Invalid XML characters from an excel file and writing it back to disk causes file is corrupted error on opening in MS Excel

Question

A little background on problem:

We have an ASP.NET MVC5 Application where we use FlexMonster to show the data in grid. The data source is a stored procedure that brings all the data into the UI grid, and once user clicks on export button, it exports the report to Excel. However, in some cases export to excel is failing. Some of the data has some invalid characters, and it is not possible/feasible to fix the source as suggested here

My approach so far:

EPPlus library fails on initializing the workbook as the input excel file contains some invalid XML characters. I could find that the file is dumped with some invalid character in it. I looked into the possible approaches .

Firstly, I identified the problematic character in the excel file. I first tried to replace the invalid character with blank space manually using Notepad++ and the EPPlus could successfully read the file.

Now using the approaches given in other SO thread here and here, I replaced all possible occurrences of invalid chars. I am using at the moment

XmlConvert.IsXmlChar

method to find out the problematic XML character and replacing with blank space.

I created a sample program where I am trying to work on the problematic excel sheet.

//in main method 
String readFile = File.ReadAllText(filePath);
string content = RemoveInvalidXmlChars(readFile);
File.WriteAllText(filePath, content);

//removal of invalid characters
        static string RemoveInvalidXmlChars(string inputText)  
        {
            StringBuilder withoutInvalidXmlCharsBuilder = new StringBuilder();
            int firstOccurenceOfRealData = inputText.IndexOf("<t>");
            int lastOccurenceOfRealData = inputText.LastIndexOf("</t>");

            if (firstOccurenceOfRealData < 0 ||
                lastOccurenceOfRealData < 0 ||
                firstOccurenceOfRealData > lastOccurenceOfRealData)
                return inputText;

            withoutInvalidXmlCharsBuilder.Append(inputText.Substring(0, firstOccurenceOfRealData)); 
            int remaining = lastOccurenceOfRealData - firstOccurenceOfRealData;
            string textToCheckFor = inputText.Substring(firstOccurenceOfRealData, remaining); 

            foreach (char c in textToCheckFor)
            {
                withoutInvalidXmlCharsBuilder.Append((XmlConvert.IsXmlChar(c)) ? c : ' ');
            }
      withoutInvalidXmlCharsBuilder.Append(inputText.Substring(lastOccurenceOfRealData));

            return withoutInvalidXmlCharsBuilder.ToString();

        }

If I replaces the problematic character manually using notepad++, then the file opens fine in MSExcel. The above mentioned code successfully replaces the same invalid character and writes the content back to the file. However, when I try to open the excel file using MS Excel, it throws an error saying that file may have been corrupted and no content is displayed (snapshots below). Moreover, Following code

var excelPackage = new ExcelPackage(new FileInfo(filePath));

on the file that I updated via Notepad++, throws following exception

"CRC error: the file being extracted appears to be corrupted. Expected 0x7478AABE, Actual 0xE9191E00"}

My Questions:

Is my approach to modify content this way correct?
If yes, How can I write updated string to an Excel file?
If my approach is wrong then, How can I proceed to get rid of invalid XML chars?

Errors shown on opening file (without invalid XML char):

First Pop up

When I click on yes

Thanks in advance !

`it exports the report to Excel.` are you sure? Or is it a CSV/HTML file with a fake extension? Did you check the actual file contents? An `xlsx` file is a zip package that contains XML files. You can't just edit it with a text editor — Panagiotis Kanavos, May 31 '19 at 09:37
Which means, if that really is a zip package with XML files, you can open it with ZipPackage and modify the files directly. If you can't open them using XmlSerialider or XDocument you can treat them as text files — Panagiotis Kanavos, May 31 '19 at 09:39
This problematic file is a file posted back to the server in HttpPostedFileBase object input stream. The server saves this file and then using EPPlus adds another workbook sheet to the final Excel file. As per FlexMonster documentation, it says that exportTo method exports it to excel File. We use flex monster to export the grid data to an excel file — kuldeep, May 31 '19 at 09:43
That's not what I asked. What does that file *actually* contain? If you can open it with a text editor it's not an Excel file. A *lot* of sites and libraries generate fake Excel files by creating an HTML or CSV file with an `xls` or `xlsx` extension. The supported Excel file format, `xlsx`, is a zip package, which means you should be able to open it with the compression utility of your choice, eg WinRar, 7z, and even extract its contents. The old `xls` format was deprecated 12 years ago — Panagiotis Kanavos, May 31 '19 at 09:47
Assuming the file is a real `xlsx` file, it means you can just open it with the [ZipArchive](https://learn.microsoft.com/en-us/dotnet/api/system.io.compression.ziparchive?view=netframework-4.8) class and edit its contents. You can access individual files through the [ZipArchive.Entries](https://learn.microsoft.com/en-us/dotnet/api/system.io.compression.ziparchive.entries?view=netframework-4.8#System_IO_Compression_ZipArchive_Entries) property — Panagiotis Kanavos, May 31 '19 at 09:48
After that you can open a stream on the file with [ZipArchiveEntry.Open](https://learn.microsoft.com/en-us/dotnet/api/system.io.compression.ziparchiveentry.open?view=netframework-4.8#System_IO_Compression_ZipArchiveEntry_Open) — Panagiotis Kanavos, May 31 '19 at 09:51
What i notice is that it has lot of characters such as PKETXOT, NULNUL NUL, and along with them it has different xmls such as xl/worksheets/sheet1.xml, xl/sharedStrings.xml and then content. — kuldeep, May 31 '19 at 09:51

score 1 · Accepted Answer · answered Jun 01 '19 at 00:49

It does sounds like a binary (presumable XLSX) file based on your last comment. To confirm, open the file created by the FlexMonster with 7zip. If it opens properly and you see a bunch of XML files in folders, its a XLSX.

In that case, a search/replace on a binary file sounds like a very bad idea. It might work on the XML parts but might also replace legit chars in other parts. I think the better approach would be to do as @PanagiotisKanavos suggests and use ZipArchive. But you have to do rebuild it in the right order otherwise Excel complains. Similar to how it was done here https://stackoverflow.com/a/33312038/1324284, you could do something like this:

public static void ReplaceXmlString(this ZipArchive xlsxZip, FileInfo outFile, string oldString, string newstring)
{
    using (var outStream = outFile.Open(FileMode.Create, FileAccess.ReadWrite))
    using (var copiedzip = new ZipArchive(outStream, ZipArchiveMode.Update))
    {
        //Go though each file in the zip one by one and copy over to the new file - entries need to be in order
        foreach (var entry in xlsxZip.Entries)
        {
            var newentry = copiedzip.CreateEntry(entry.FullName);
            var newstream = newentry.Open();
            var orgstream = entry.Open();

            //Copy non-xml files over
            if (!entry.Name.EndsWith(".xml"))
            {
                orgstream.CopyTo(newstream);
            }
            else
            {
                //Load the xml document to manipulate
                var xdoc = new XmlDocument();
                xdoc.Load(orgstream);

                var xml = xdoc.OuterXml.Replace(oldString, newstring);
                xdoc = new XmlDocument();
                xdoc.LoadXml(xml);

                xdoc.Save(newstream);
            }

            orgstream.Close();
            newstream.Flush();
            newstream.Close();
        }
    }
}

When it is used like this:

[TestMethod]
public void ReplaceXmlTest()
{
    var datatable = new DataTable("tblData");
    datatable.Columns.AddRange(new[]
    {
        new DataColumn("Col1", typeof (int)),
        new DataColumn("Col2", typeof (int)),
        new DataColumn("Col3", typeof (string))
    });

    for (var i = 0; i < 10; i++)
    {
        var row = datatable.NewRow();
        row[0] = i;
        row[1] = i * 10;
        row[2] = i % 2 == 0 ? "ABCD" : "AXCD";
        datatable.Rows.Add(row);
    }

    using (var pck = new ExcelPackage())
    {
        var workbook = pck.Workbook;
        var worksheet = workbook.Worksheets.Add("source");

        worksheet.Cells.LoadFromDataTable(datatable, true);
        worksheet.Tables.Add(worksheet.Cells["A1:C11"], "Table1");

        //Now similulate the copy/open of the excel file into a zip archive
        using (var orginalzip = new ZipArchive(new MemoryStream(pck.GetAsByteArray()), ZipArchiveMode.Read))
        {
            var fi = new FileInfo(@"c:\temp\ReplaceXmlTest.xlsx");
            if (fi.Exists)
                fi.Delete();

            orginalzip.ReplaceXmlString(fi, "AXCD", "REPLACED!!");
        }
    }
}

Gives this:

Just keep in mind that this is completely brute force. Anything you can do to make the file filter smarter rather then simply doing ALL xml files would be a very good thing. Maybe limit it to the SharedString.xml file if that is where the problem lies or in the xml files in the worksheet folders. Hard to say without knowing more about the data.

Thank you. Based on @PanagiotisKanavos and your suggestions, I could successfully replace the invalid XML chars using ZipArchive. However, i was also able to fix it at the data access level, by parsing through the data set. Nevertheless, something new I learned as I had no idea how internals of excel works ! Thanks a lot again ! — kuldeep, Jun 03 '19 at 13:20
@kuldeep Glad to hear it. Fixing at the data level is a MUCH safer approach :) — Ernie S, Jun 03 '19 at 15:27

Replacing Invalid XML characters from an excel file and writing it back to disk causes file is corrupted error on opening in MS Excel

1 Answers1

Linked