12

I have many large csv files (1-10 gb each) which I'm importing into databases. For each file, I need to replace the 1st line so I can format the headers to be the column names. My current solution is:

using (var reader = new StreamReader(file))
{
    using (var writer = new StreamWriter(fixed))
    {
        var line = reader.ReadLine();
        var fixedLine = parseHeaders(line);
        writer.WriteLine(fixedLine);

        while ((line = reader.ReadLine()) != null)
            writer.WriteLine(line);
    }
}

What is a quicker way to only replace line 1 without iterating through every other line of these huge files?

Dinah
  • 52,922
  • 30
  • 133
  • 149
  • I'd probably just do this from the command line. `copy headerfile+csvfile newfile` You could make a batch file with all of the files that need to be changed. (Oh, except it looks like you don't know what the headers will be beforehand. If not, this won't help.) – Michael Todd Dec 06 '12 at 16:41
  • If the tool you're using happens to take input through `stdin`: instead of a file, you can create a stream composed of the headers and the body and feed that into your import tool when you need it. e.g. `cat headerfile bigfile | import_tool` – Clinton Pierce Dec 06 '12 at 16:44
  • This is very sensitive to the Lengths of fixedLine and the first line. Can fixedLine ever be larger? Could line2 move to the end of the file? – H H Dec 06 '12 at 16:46
  • have you looked at this? http://stackoverflow.com/questions/3016734/perl-how-do-i-remove-the-first-line-of-a-file-without-reading-and-copying-whole – Chirayu Shishodiya Dec 06 '12 at 16:47
  • Hmm, gets tricky when the bit you want to replace is a different size. Otherwise `MemoryMappedFiles` were looking like a good idea, http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx – Jodrell Dec 06 '12 at 16:48
  • 2
    seems relevant http://stackoverflow.com/questions/5560191/prepending-data-to-a-file – Jodrell Dec 06 '12 at 17:02
  • 2
    also relevant http://stackoverflow.com/questions/2563976/c-write-to-front-of-file – Jodrell Dec 06 '12 at 17:06
  • short of writing your own file system, seems the options are limited. – Jodrell Dec 06 '12 at 17:11
  • I know this is old, but for other people finding this: If you're instantly loading the file into the database via your own software, you could just use a reader to read the first line, start a transaction with your renamed column names and then continue reading line by line to fill your database. Easiest and fastest approach most likely, since you don't need to copy the whole file. – SharpShade Jun 06 '18 at 11:36

2 Answers2

7

If you can guarantee that fixedLine is the same length (or less) as line, you can update the files in-place instead of copying them.

If not, you can possibly get a little performance improvement by accessing the .BaseStream of your StreamReader and StreamWriter and doing big block copies (using, say, a 32K byte buffer) to do the copying, which will at least eliminate the time spent checking every character to see if it's an end-of-line character as happens now with reader.ReadLine().

prprcupofcoffee
  • 2,950
  • 16
  • 20
  • If it were less, what would you do with the "free" space? Pad it with spaces? – Jodrell Dec 06 '12 at 17:08
  • 1
    Probably, but it would depend on the format. In a CSV file, I could quote the header elements, and add padding spaces after commas. – prprcupofcoffee Dec 06 '12 at 17:13
  • 1
    +1. @Jodrell, yes, spaces in most cases are ignored for CSV - so using them for padding would be OK (I tried to suggest it in my identical answer :) ). Spaces also safe for most encodings - so can be written directly into stream as bytes if needed. – Alexei Levenkov Dec 06 '12 at 17:14
6

The only thing that can significantly speed it up is if you can really replace first line. If new first line is no longer than old one - replace (with space padding if needed) the first line carefully.

Otherwise - you have to create new file and copy the rest after first line. You may be able to optimize copying a bit by adjusting buffer sizes/explicit copy as binary/per-allocating size, but it will not change the fact that you need to copy whole file.

One more cheat if you planning to drop CSV data into DB anyway: if order does not matter you can read some lines from the beginning, replace them with new header and add the removed lines to the end of the file.

Side note: if this is one-time operation I'd simply copy files and be done with it... Debugging code that inserts data into middle of text file with potentially different encoding may not worth an effort.

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179