C# Parsing files that are missing new line character at the end of the file

Question

Question: What is the best way to parse files that are missing the new line character at the end of the file? Should I just use a try and catch on OutOfMemoryException? Or, is there a better way?

Background: I am parsing log files using StreamReader's Readline() method to read in the next line. So, the basic loop structure looks like this:

while ((line = sr.ReadLine()) != null)
{
      // Parse the file
}

This works well, even on large files (i.e., > 2GB). But, when the next line is not null and does not contain a new line character then StreamReader just reads blank spaces until all memory is consumed and an OutOfMemoryException is thrown. Is this the best way to handle a missing new line character at the end of the file? Or, are there better ways of handling this problem?

Note: the file is being created from IIS Exchange Server. Without digging in with our IT group, the file appears to be cutoff mid-creation, resulting in the last row being bad as it is missing data.

Research: I found a posting on SO (see below) that refers to using File.ReadFile. While it works on a much smaller file (i.e., < 2GB) that is missing the new line character, it still fails on large files (i.e., > 2GB).

https://stackoverflow.com/a/13416225

https://learn.microsoft.com/en-us/dotnet/api/system.io.file.readlines?redirectedfrom=MSDN&view=netframework-4.7.2#System_IO_File_ReadLines_System_String_

Edit

The compiler stops at the While line in the code sample below. The problem is not with the code, but with the file. I cannot post our log files. But, to demonstrate, create a few rows of data in NotePad++. For the last row of the file, remove the NewLine character and then run the file. StreamReader will blow up on the last row because it cannot find the end of the row.

Below is a copy of the log file with all data contents removed, with exception to the timestamp and the NewLine character at the end of each row. For the last row, I included the last data element (port number) before the data cuts off. Notice that the last row is missing the new line character?

`Note: the file is being created from IIS Exchange Server.` If that is generating incorrect files, I'd aim to fix the bug there. — mjwills, Feb 11 '19 at 23:01
Really? That's a bit surprising. You might be able to write your own version (read the file contents in block by block and do your own line-by-line return management). I've never liked writing stuff that that, but it should be possible. Those references you refer to don't seem to address the issue you are talking about. — Flydog57, Feb 11 '19 at 23:02
By the way, catching an out of memory exception is pretty useless. At that point, you are out of memory - there isn't much you can do. — Flydog57, Feb 11 '19 at 23:04
I cannot reproduce this issue. The last line without the trailing `\r\n` is read and then the next call to `ReadLine()` returns null. Something else must be wrong. — John Wu, Feb 11 '19 at 23:05
That's not the behavior of StreamReader.ReadLine(). Please post an actual [mcve] that demonstrates the issue. I can produce my own file missing a CRLF at the end for testing. — Ken White, Feb 11 '19 at 23:06
For what it's worth, the SO reference you point to is an OOM situation, but it's the result of the program building a huge structure in memory from what he reads from the file (in his case, it's just a **very** large `List`, but still). Are you sure your OOM isn't the result of whatever you are building in memory? — Flydog57, Feb 11 '19 at 23:08
why not?.. `while (!sr.EndOfStream) { line = sr.ReadLine(); }` — Sorceri, Feb 11 '19 at 23:11
@Sorceri file with 4TB of text without single new line will OOM this code (which I believe is exactly what OP is dealing with) — Alexei Levenkov, Feb 11 '19 at 23:16
" StreamReader will blow up on the last row because it cannot find the end of the row"? - completely false. `StreamReader` treats end of file as end of string. — Alexei Levenkov, Feb 11 '19 at 23:20
@AlexeiLevenkov he stated it was the last row that was cut off so it was my understanding that was th row with out the newline but it should be the eof — Sorceri, Feb 11 '19 at 23:20
@Sorceri Correct - it's the last row. Basically, the file is being cutoff by the extraction process. I have alerted our IT group so they can look into it. But, I still need to handle these files. — J Weezy, Feb 11 '19 at 23:24
@Sorceri with current edits it does look different. Still unclear what they face (also it looks like they have good sample that can be cleaned up to become real [MCVE] for this post so it can be answered) — Alexei Levenkov, Feb 11 '19 at 23:24
@AlexeiLevenkov I am having to parse daily log extracts, so it is unknown which files are missing the NewLine character at the end of file (note: next line would be NULL so the loop would break). It seems like waiting for the code to throw an OutOfMemoryException is the only way to identify which files are missing the NewLine character at the end of the file. Otherwise, the entire loop fails and the file is deleted. I want to preserve parsed records. — J Weezy, Feb 11 '19 at 23:27
@Flydog57 I know. I just posted my research so that the community can see what I have already looked at. That solution did not work. — J Weezy, Feb 11 '19 at 23:31
@JWeezy you have file that causes the issue - start trimming it till it can be posted as example here - if whatever you are claiming to be the issue is true then just 10-20 bytes would be enough (you can even show then as hex dump). Even if you could not at least hopefully you'll narrow issue down. — Alexei Levenkov, Feb 11 '19 at 23:34
You're going in the wrong direction. `StreamReader` works fine without a newline at the end. If you have a multi-GB file without newlines you need to use [`StreamReader.Read()`](https://learn.microsoft.com/en-us/dotnet/api/system.io.streamreader.read?view=netframework-4.7.1), not `ReadLine`. — Dour High Arch, Feb 12 '19 at 16:45
@DourHighArch Dour, can you please provide an answer with a code snippet for the While loop that implements the solution you recommend? Note: I am not having any problem with ReadLine() on multi-GB files as it works on other similarly large files. It is only failing where it cannot read the end of the line. — J Weezy, Feb 12 '19 at 16:58
@J I really doubt the missing newline has anything to do with your problem. You need to post code that shows the problem because what you have posted does not. Code that parses a multi-GB text line is at [Extremely Large Single-Line File](https://stackoverflow.com/questions/26247952/). — Dour High Arch, Feb 12 '19 at 17:05

Andrew · Answer 1 · 2019-02-11T23:55:09.050

1

This should work: Should be checking for EndOfStream before trying to read the next line. Added some checking for null as well.

while (!sr.EndOfStream)
{
  line = sr.ReadLine()?.Trim() ?? "";
  // Parse the line
}

edited Feb 11 '19 at 23:55

answered Feb 11 '19 at 23:43

Andrew

59
5

Thank you for your post. I tried the implementation above. Unfortunately, that did not work - I am still getting the OutOfMemoryException error. Any other thoughts or ideas? – J Weezy Feb 12 '19 at 15:46

score 0 · Accepted Answer · answered Feb 12 '19 at 17:55

I have confirmed the file was bad it our IT group. What happened is that the original transfer process over the network to my local seems to have experienced a hiccup. I re-transferred the file and it parsed successfully. There are also more rows. What threw me off of this was that the file sizes between the network and my local were identical - so I did not consider re-transmitting the file during my research efforts.

The file transfer process seems to first be allocating a full file as empty and then starts filling it with data. Good luck diagnosing extremely large files that cannot be opened by standard text editors (e.g., Notepad, Notepadd++, Excel, etc.) to see this. I had to use Ultra Edit and the problem became visible.

Per Hans Passant's comment on a related question (see link below), StreamReader's Readline() method will handle large files just fine as it handles the file-system caching internally. So, OutOfMemoryExceptions should not be a problem. I assume this was aimed at computers with insufficient memory as opposed to bad files.

Thank you all for the troubleshooting and my apologies for any interruption.

Unable to read large log file with MemoryMappedViewStream

C# Parsing files that are missing new line character at the end of the file

2 Answers2