19

When the string <?xml version is written to a file via fwrite, the subsequent writing operations become slower.

This code :

#include <cstdio>
#include <ctime>
#include <iostream>

int main()
{
    const long index(15000000); 

    clock_t start_time(clock());
    FILE*  file_stream1 = fopen("test1.txt","wb");
    fwrite("<?xml version",1,13,file_stream1);
    for(auto i = 1;i < index ;++i)
        fwrite("only 6",1,6,file_stream1);
    fclose(file_stream1);

    std::cout << "\nOperation 1 took : " 
        << static_cast<double>(clock() - start_time)/CLOCKS_PER_SEC 
        << " seconds.";


    start_time = clock();
    FILE*  file_stream2 = fopen("test2.txt","wb");
    fwrite("<?xml versioX",1,13,file_stream2);
    for(auto i = 1;i < index ;++i)
        fwrite("only 6",1,6,file_stream2);
    fclose(file_stream2);

    std::cout << "\nOperation 2 took : " 
        << static_cast<double>(clock() - start_time)/CLOCKS_PER_SEC 
        << " seconds.";


    start_time = clock();
    FILE*  file_stream3 = fopen("test3.txt","w");
    const char test_str3[] = "<?xml versioX";
    for(auto i = 1;i < index ;++i)
        fwrite(test_str3,1,13,file_stream3);
    fclose(file_stream3);

    std::cout << "\nOperation 3 took : " 
        << static_cast<double>(clock() - start_time)/CLOCKS_PER_SEC 
        << " seconds.\n";

    return 0;
}

Gives me this result :

Operation 1 took : 3.185 seconds.
Operation 2 took : 2.025 seconds.
Operation 3 took : 2.992 seconds.

That is when we replace the string "<?xml version" (operation 1) with "<?xml versioX" (operation 2) the result is significantly faster. The third operation is as fast as the first though it's writing twice more characters.

Can anyone reproduce this?

Windows 7, 32bit, MSVC 2010

EDIT 1

After R.. suggestion, disabling Microsoft Security Essentials restores normal behavior.

anno
  • 5,970
  • 4
  • 28
  • 37
  • 10
    Perhaps you have anti-virus software that's hooked all file operations and which kicks in at this point... – R.. GitHub STOP HELPING ICE May 07 '11 at 23:37
  • Have you tried switching the order of the writes? I wouldn't be surprised if it was just that the first write takes longer. – David Brown May 07 '11 at 23:40
  • 1
    R.., disabling Microsoft Security Essentials restores normal behavior. Would you care to elaborate and post an answer. – anno May 07 '11 at 23:44
  • That's pretty awful... a workaround is closing the stream after " – anno May 07 '11 at 23:50
  • 2
    The workaround is invalid. The second `fopen` call truncates the file and obliterates the XML header. You'd need to use `"rb+"` rather than `"wb"`, and then the AV will probably kick right back in... You could perhaps write `XXXXXXXXX` at the beginning of the file, then seek back and fix the header once you've finished writing the rest... – R.. GitHub STOP HELPING ICE May 07 '11 at 23:57
  • Yes, typing without thinking :( – anno May 08 '11 at 00:02
  • I've tried the "XXXXXXXXX" trick, same result :( – anno May 08 '11 at 00:35
  • All the time is lost in fclose(file_stream); – anno May 08 '11 at 01:08
  • 2
    I would file a bug report with Microsoft. You might be able to tweak the options for when/which files are subject to scanning, but this won't help if you're deploying your software to users whose AV config is outside your control (except that you could mention the bug and workarounds in your release notes/documentation). The other alternative would be to drop XML... – R.. GitHub STOP HELPING ICE May 08 '11 at 01:31

1 Answers1

27

On Windows, most (all?) anti-virus software works by hooking into the file read and/or write operations to run the data being read or written again virus patterns and classify it as safe or virus. I suspect your anti-virus software, once it sees an XML header, loads up the XML-malware virus patterns and from that point on starts constantly checking to see if the XML you're writing to disk is part of a known virus.

Of course this behavior is utterly nonsensical and is part of what gives AV programs such a bad reputation with competent users, who see their performance plummet as soon as they turn on AV. The same goal could be accomplished in other ways that don't ruin performance. Here are some ideas they should be using:

  • Only scan files once at transitions between writing and reading, not after every write. Even if you did write a virus to disk, it doesn't become a threat until it subsequently gets read by some process.
  • Once a file is scanned, remember that it's safe and don't scan it again until it's modified.
  • Only scan files that are executable programs or that are detected as being used as script/program-like data by another program.

Unfortunately I don't know of any workaround until AV software makers wise up, other than turning your AV off... which is generally a bad idea on Windows.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • Wow, nice catch. I'd have never thought of an antivirus behavior for this. Then again, I don't use an antivirus, so... – user541686 May 07 '11 at 23:59
  • Most (if not all) antivirus software only scans files on open or close; I would be surprised if anybody actually did it on individual reads or writes. The problem with not scanning files again is that there may not be a definition for it yet; today it may be unknown, but tomorrow it may be bad. As for only scanning executable files, you still have to examine the file to determine whether or not it is an executable. None of these problems are insurmountable, but as with all software there is a tradeoff; in this case it is between performance and security. – Luke May 08 '11 at 03:35
  • The cached flag that a file has already been scanned need only include the version number of the virus patterns it was scanned against. And the *transition between writing and reading* is the key point of my post. Only scanning on the first read after a write, rather than on all reads or all writes, is the key to fixing performance while retaining the full security benefits of scanning. – R.. GitHub STOP HELPING ICE May 08 '11 at 03:51
  • 1
    The problem with caching is that definition updates invalidate your cache; with most antivirus companies releasing definition updates hourly, caching isn't going to buy you a whole lot (I guess an hour is better than nothing, though). I don't see how scanning only on the first read after write helps, though. Whether it happens on the write or the first read after the write, the file's still going to be scanned and the result cached (subsequent scans would get the cached result). Seems like it would only help for files which are written often but not read often (e.g. log files). – Luke May 09 '11 at 01:33
  • 1
    I don't know how often they're published, but most systems are set to download updated "virus definitions" at most once a day or once a week. In any case, my proposed solution would help equally well for files that are read often, as long as they're not both written and read often. – R.. GitHub STOP HELPING ICE May 09 '11 at 22:44