24

This is a tricky question. I suspect it will require some advanced knowledge of file systems to answer.

I have a WPF application, "App1," targeting .NET framework 4.0. It has a Settings.settings file that generates a standard App1.exe.config file where default settings are stored. When the user modifies settings, the modifications go in AppData\Roaming\MyCompany\App1\X.X.0.0\user.config. This is all standard .NET behavior. However, on occasion, we've discovered that the user.config file on a customer's machine isn't what it's supposed to be, which causes the application to crash.

The problem looks like this: user.config is about the size it should be if it were filled with XML, but instead of XML it's just a bunch of NUL characters. It's character 0 repeated over and over again. We have no information about what had occurred leading up to this file modification.

enter image description here

We can fix that problem on a customer's device if we just delete user.config because the Common Language Runtime will just generate a new one. They'll lose the changes they've made to the settings, but the changes can be made again.

However, I've encountered this problem in another WPF application, "App2," with another XML file, info.xml. This time it's different because the file is generated by my own code rather than by the CLR. The common themes are that both are C# WPF applications, both are XML files, and in both cases we are completely unable to reproduce the problem in our testing. Could this have something to do with the way C# applications interact with XML files or files in general?

Not only can we not reproduce the problem in our current applications, but I can't even reproduce the problem by writing custom code that generates errors on purpose. I can't find a single XML serialization error or file access error that results in a file that's filled with nulls. So what could be going on?

App1 accesses user.config by calling Upgrade() and Save() and by getting and setting the properties. For example:

if (Settings.Default.UpgradeRequired)
{
    Settings.Default.Upgrade();
    Settings.Default.UpgradeRequired = false;
    Settings.Default.Save();
}

App2 accesses info.xml by serializing and deserializing the XML:

public Info Deserialize(string xmlFile)
{
    if (File.Exists(xmlFile) == false)
    {
        return null;
    }

    XmlSerializer xmlReadSerializer = new XmlSerializer(typeof(Info));

    Info overview = null;

    using (StreamReader file = new StreamReader(xmlFile))
    {
        overview = (Info)xmlReadSerializer.Deserialize(file);
        file.Close();
    }

    return overview;
}

public void Serialize(Info infoObject, string fileName)
{
    XmlSerializer writer = new XmlSerializer(typeof(Info));

    using (StreamWriter fileWrite = new StreamWriter(fileName))
    {
        writer.Serialize(fileWrite, infoObject);
        fileWrite.Close();
    }
}

We've encountered the problem on both Windows 7 and Windows 10. When researching the problem, I came across this post where the same XML problem was encountered in Windows 8.1: Saved files sometime only contains NUL-characters

Is there something I could change in my code to prevent this, or is the problem too deep within the behavior of .NET?

It seems to me that there are three possibilities:

  1. The CLR is writing null characters to the XML files.
  2. The file's memory address pointer gets switched to another location without moving the file contents.
  3. The file system attempts to move the file to another memory address and the file contents get moved but the pointer doesn't get updated.

I feel like 2 and 3 are more likely than 1. This is why I said it may require advanced knowledge of file systems.

I would greatly appreciate any information that might help me reproduce, fix, or work around the problem. Thank you!

Kyle Delaney
  • 11,616
  • 6
  • 39
  • 66
  • 2
    Maybe there is power loss (like when you forcibly shutdown computer) at the moment of writing that file? In such case I think it's possible to have situation like yours. – Evk Mar 13 '18 at 16:01
  • 3
    I would replace the using statements with Try/Catch and save results into a log file. The using statement hides the exception so you do not know that an exception occurs and the code will continue like nothing ever went wrong. – jdweng Mar 13 '18 at 16:01
  • 2
    @jdweng While I certainly should try to gather diagnostic data with try/catch, I don't believe the `using` statement suppresses exceptions. I can generate exceptions within `using` blocks just fine. – Kyle Delaney Mar 13 '18 at 16:18
  • @Evk I suppose it is possible for power loss during file operations to corrupt files, but writing to these files is infrequent and takes a millisecond. I'd be astonished if even one case of that could occur, and we've seen 10+ cases. – Kyle Delaney Mar 13 '18 at 16:29
  • But does the user get a pop up? Where does the results go? – jdweng Mar 13 '18 at 16:54
  • @jdweng The application does have an error logging system where error logs are kept in a local database which regularly gets uploaded to our server. I may have to put in special error handling for this case, though. – Kyle Delaney Mar 13 '18 at 19:32
  • 5
    Although it is possible there is some hideous bug in the CLR that causes this problem, the logic involved is quite simple, and the impact is sufficiently big that you'd expect such a bug to have been discovered and fixed by now (although that is of course no hard guarantee). My money is on file corruption caused by bad file system filter drivers. Ask your customer what kind of antivirus/anti-malware software is installed. Also, ask if they are using true roaming profiles, of the kind that gets uploaded to the network and transferred across machines -- that's obviously another point of failure. – Jeroen Mostert Mar 14 '18 at 13:20
  • How do you detect the condition? Because I'm wondering if something created a sparse file somehow. See https://msdn.microsoft.com/en-us/library/windows/desktop/aa365276(v=vs.85).aspx If you *are* getting sparse files, that might give you some indication of what the cause might be. Any chance it's a race condition between multiple threads or processes? – Andrew Henle Mar 14 '18 at 19:55
  • 1
    I have a similar problem, and I had posted the question here https://stackoverflow.com/questions/49269579/encrypt-aescryptoserviceprovider-return-zero-byte-array Hope we can get the solution – TTGroup Mar 15 '18 at 06:25
  • @AndrewHenle Thanks for the suggestion. I ran a test and `GetFileSize` and `GetCompressedFileSize` return the same value, so that indicates that it's not a sparse file. – Kyle Delaney Mar 15 '18 at 18:05
  • 1
    So much for that... Add code to log everything you do with these files. Log the write, then read the data back and make sure it's correct. Log that the data is correct. That will at least isolate where and maybe when it's happening. I'm assuming you're already logging and/or reporting whenever you find a corrupt file. – Andrew Henle Mar 15 '18 at 19:00
  • 1
    Check if disk has write cache configured: https://www.tenforums.com/tutorials/21904-enable-disable-disk-write-caching-windows-10-a.html – Simon Mourier Mar 16 '18 at 06:06
  • 1
    I have a problem similar to this, not sure if it's exactly the same. We store state information in an XML file on the local hard disk. The exception we get is that dot net can't read the XML file because the first character is null. I have statistics which indicate that this is occurring for us approximately one in a million times. – Richardissimo Mar 16 '18 at 18:07
  • 1
    I have the same problem, I even write the xml as temp file, read the file back verify that I have the values in the xml and then rename the file. I get corrupted xml files now and then and I leaning towards that this is an VM / Windows issue. – Archlight Mar 20 '18 at 14:04
  • 1
    Just as an additional note. Exactly the same behavior is happening at some of our customers where the xml file is Serialized using XmlSerializer. Mostly happening/happened on 2003, seldom happening on 2008 and rare happening on 2012... There was no system event error or disk corruption (hardware raids from several vendors). Currently we are taking a look closer to an unexpected system shutdown... – dsdel Mar 20 '18 at 20:15
  • 1
    Try disabling any background service accessing your files when written. I had exactly this behavior due to a backup program running. – Anders Forsgren Apr 02 '18 at 07:33
  • @AndersForsgren, that does seem consistent with the idea that this has to do with memory address pointers being mishandled by the file system. – Kyle Delaney Apr 02 '18 at 19:22
  • 1
    For the record, I have now managed to get hold of one of these files from a user's machine, and I can confirm that the entire file is completely filled with nulls. (The exception we get is `System.Xml.XmlException: '.', hexadecimal value 0x00, is an invalid character. Line 1, position 1.`, and the file this is occurring on is just an XML file, not specifically a config file.) – Richardissimo Apr 11 '18 at 14:15

5 Answers5

20

It's well known that this can happen if there is power loss. This occurs after a cached write that extends a file (it can be a new or existing file), and power loss occurs shortly thereafter. In this scenario the file has 3 expected possible states when the machine comes back up:

1) The file doesn't exist at all or has its original length, as if the write never happened.

2) The file has the expected length as if the write happened, but the data is zeros.

3) The file has the expected length and the correct data that was written.

State 2 is what you are describing. It occurs because when you do the cached write, NTFS initially just extends the file size accordingly but leaves VDL (valid data length) untouched. Data beyond VDL always reads back as zeros. The data you were intending to write is sitting in memory in the file cache. It will eventually get written to disk, usually within a few seconds, and following that VDL will get advanced on disk to reflect the data written. If power loss occurs before the data is written or before VDL gets increased, you will end up in state 2.

This is fairly easy to repro, for example by copying a file (the copy engine uses cached writes), and then immediately pulling the power plug on your computer.

Craig Barkhouse
  • 391
  • 3
  • 7
  • 1
    Thanks for this: I'm very interested. Please can you link to any sources for this information? (Clearly, this information corresponds with Beastwood's answer and dsdel's comment.) – Richardissimo Oct 11 '18 at 05:42
  • Excellent answer. I appreciate your knowledge of file systems. – Kyle Delaney Oct 11 '18 at 16:12
  • 2
    Sorry I don't know offhand what source I could link to that describes these interactions. I'm an NTFS developer at Microsoft, so I'm just describing how it works from first hand knowledge. – Craig Barkhouse Oct 11 '18 at 17:26
  • Could this also happen with a planned reboot, e.g. for a Windows update? See my investigation: https://superuser.com/a/1402396/13089 – BlackShift Feb 05 '19 at 20:48
  • This is probably why we're seeing a similar problem in production. And why it happens with both the freshly created backup-file and the newly created file. – Håkon K. Olafsen Jun 11 '19 at 12:57
  • I thought journalling file systems (to my knowledge, NTFS is a journalling file system) were invented to prevent exactly this from happening. Shouldn't it replay the journal on the next boot and fix this problem? – jeyk Dec 07 '22 at 14:43
  • Metadata operations are journalled (e.g. you rename a file, or set an attribute on a file, etc.). However, data writes are not journalled, on any journalling file system that I know of. The performance considerations make this completely unfeasible. – Craig Barkhouse Dec 08 '22 at 20:59
3

I had a similar problem and I was able to trace my problem to corrupted HDD.

Description of my problem (all related informations):

  • Disk attached to mainboard (SATA):

    • SSD (system),

    • 3 * HDD.

      One of the HDD's had a bad blocks and there were even problems reading the disk structure (directories and file listing).

  • Operation system: Windows 7 x64

  • file system (on all disks): NTFS

When the system tried to read or write to the corrupted disk (user request or automatic scan or any other reason) and the attempt failed, all write operations (to other disk's) were incorrect. The files created on system disk (mostly configuration files by another applications) were written and were valid (probably because the files were cashed in RAM) on direct check of file content.

Unfortunately, after a restart, all the files (written after the failed write/read access on corrupted drive) had the correct size, but the content of the files was 'zero byte' (exactly like in your case).

Try rule out hardware related problems. You can try to check 'copy' the file (after a change) to a different machine (upload to web/ftp). Or try to save specific content to a fixed file. When the check file on different will be correct, or when the fixed content file will be 'empty', the reason is probably on local machine. Try to change HW components, or reinstall the system.

Julo
  • 1,102
  • 1
  • 11
  • 19
  • By "reinstall the system" do you mean the operating system? – Kyle Delaney Mar 20 '18 at 15:11
  • Problem is, they can't tell each and every client that encounters the problem to 'change HW components or reinstall the system'. If I would have been a client who'd been told such a thing i'd get mad. This is why it is better deleting then trying to tell the clients they need to fix their broken computer... – Barr J Mar 21 '18 at 05:49
  • **Kyle Delaney**: yes, system reinstall; in my case it was necessary *(when the corrupted disk was still a system disk in different computer with Windows XP)*. But the problems that the drive caused were different. @BarrJ : I know, but when there is a real HW problem that can not be masked *(unless some sort of external storage is used, e.g. FTP, on SQL on different computer)*, this will remain the only option. The problem will then be masked only for a single application. Of course, only when the problem is caused by HW *(that can be checked, see my post)*. – Julo Mar 21 '18 at 08:07
  • Still, trying to explain to the customer, they need to reinstall their entire system or worse, need to buy new HW for their computer just because the file that YOUR program generates malfunctions is not idle. if the OP says that deleting the file helps, better just delete it and call upgrade which will restore the recently updated file. – Barr J Mar 21 '18 at 08:36
  • When the problem is in corrupted HW (or OS), then there are only two options. Mask the problem for a single application (e.g. using web/FTP/SMB), or accept loss of data (settings) for each application that stores data locally. When the client has no problem accepting this loss of data, there is no reason to reinstall system or buy new HW. Simple mask the error using any possible method available. – Julo Mar 21 '18 at 08:43
  • I had the similar problem, I tried very many ways to reproduce this problem but I never happen in my computer, and now I still can not detect the root cause. I want to know how could you reproduce this bug and how are you sure that the problem was happen as you described above. Many thanks :) – TTGroup May 24 '18 at 07:00
  • That is was this problem I found only a long time later (after first occurrence). My old system HDD had a "log" directory, where were many. really many overwrites of log file. This file was one of the files that often was empty. I have found the cause only after change of computer (XP -> 7 on SSD) and after this change the error occurred only when I accessed specific files/folders on the old system HDD. Files on SSD were zeroed. How to simulate: Try to get HDD with bad blocks *(read/write problems)* and connect it to system. Try to read files. After reading error, write some file to other disk – Julo May 24 '18 at 10:39
  • @Julo: Thank you very much :) – TTGroup May 25 '18 at 03:19
2

There is no documented reason for this behavior, as this is happening to users but nobody can tell the origin of this odd conditions.

It might be CLR problem, although this is a very unlikely, the CLR doesn't just write null characters and XML document cannot contain null characters if there's no xsi:nil defined for the nodes.

Anyway, the only documented way to fix this is to delete the corrupted file using this line of code:

try
{
     ConfigurationManager.OpenExeConfiguration(ConfigurationUserLevel.PerUserRoamingAndLocal);
}
catch (ConfigurationErrorsException ex)
{
    string filename = ex.Filename;
    _logger.Error(ex, "Cannot open config file");

    if (File.Exists(filename) == true)
    {
        _logger.Error("Config file {0} content:\n{1}", filename, File.ReadAllText(filename));
        File.Delete(filename);
        _logger.Error("Config file deleted");
        Properties.Settings.Default.Upgrade();
        // Properties.Settings.Default.Reload();
        // you could optionally restart the app instead
    }
    else
    {
        _logger.Error("Config file {0} does not exist", filename);
    }
}

It will restore the user.config using the Properties.Settings.Default.Upgrade(); again without null values.

Barr J
  • 10,636
  • 1
  • 28
  • 46
  • 2
    I use the same method, but I save backup of the configuration every x hours and then use that when I hit a corrupted xml file. – Archlight Mar 20 '18 at 14:07
  • 1
    I'm thinking I'll mark your answer as correct and give Julo the bounty because he seems to be able to reproduce the problem. And he also has less reputation. :) – Kyle Delaney Mar 21 '18 at 13:05
  • Then this is Agreed :) – Barr J Mar 21 '18 at 13:32
2

I ran into a similar issue but it was on a server. The server restarted while a program was writing to a file which caused the file to contain all null characters and become unusable to the program writing/reading from it.

So the file looked like this: enter image description here

The logs showed that the server restarted: enter image description here

The corrupted file showed that it was last updated at the time of the restart: enter image description here

Beastwood
  • 446
  • 3
  • 19
1

I have the same problem, there is an extra "NUL" character at the end of serialized xml file: enter image description here

I am using XMLWriter like this:

using (var stringWriter = new Utf8StringWriter())
        {
            using (var xmlWriter = XmlWriter.Create(stringWriter, new XmlWriterSettings { Indent = true, IndentChars = "\t", NewLineChars = "\r\n", NewLineHandling = NewLineHandling.Replace }))
            {                    
                xmlSerializer.Serialize(xmlWriter, data, nameSpaces);
                xml =  stringWriter.ToString();
                var xmlDocument = new XmlDocument();
                xmlDocument.LoadXml(xml);
                if (removeEmptyNodes)
                {
                    RemoveEmptyNodes(xmlDocument);
                }
                xml = xmlDocument.InnerXml;
            }
        }
  • I would have thought it sensible to avoid trying to use the result of the writer until the end of the `using` block. Also it isn't clear from the snippet which one ends up in the file in the screenshot; but if that file was created from `xml = xmlDocument.InnerXml;` then I think you've missed the point of the XmlWriter. – Richardissimo Nov 11 '18 at 09:14