2

I have a very large text file, over 1GB, and I have a list of integers that represent line numbers, and the need is to produce another file containing the text of the original files line numbers in the new file.

Example of original large file:

ogfile line 1
some text here
another line
blah blah

So when I get a List of "2,4,4,1" the output file should read:

some text here
blah blah
blah blah
ogfile line 1

I have tried string lineString = File.ReadLines(filename).Skip(lineNumList[i]-1).Take(1).First();

but this takes way to long as the file has to be read in, skipped to the line in question, then reread the next time... and we are talking millions of lines in the 1GB file and my List<int> is thousands of line numbers.

Is there a better/faster way to read a single line, or have the reader skip to a specific line number without "skipping" line by line?

Tizz
  • 820
  • 1
  • 15
  • 31
  • Read the file once, then grab the lines you want from that array – Rufus L Oct 09 '19 at 17:13
  • I would say just read the whole thing into memory (`File.ReadAllLines()`) and just grabbing by index – maccettura Oct 09 '19 at 17:16
  • I would need lots of RAM to beable to store the whole file into memory. What if the file is 10GB? – Tizz Oct 09 '19 at 17:25
  • 2
    **Stop using text files as databases and start using databases as databases**. – Eric Lippert Oct 09 '19 at 19:03
  • But that said: do you know anything about the list? For instance, do you know that it will have many "duplicates" (your "4, 4" scenario for instance), or hardly any duplicates? Do you know roughly what fraction of the original lines will be fetched? You seem to imply that it is about 0.1%; is that accurate? And so on. Building a scalable solution often involves taking advantage of known features of the inputs. – Eric Lippert Oct 09 '19 at 19:06
  • @EricLippert Its actually an export from a database, we have to convert to a new data format (xml transformation along with other translations) then import into the new database. There are many files created to accomplish this nearly 1TB data update and transfer. – Tizz Oct 10 '19 at 06:37
  • @EricLippert There will be multiple duplicates, not necessarily in succession. The list will be smaller than the original list, but depending on the file, it could be anywhere from 0.1% to 80%. Trying to optimize the new file to remove unneeded information (and repetitions are required i.e. one old object into two new objects both needing the same field and value) – Tizz Oct 10 '19 at 06:40

4 Answers4

5

The high-order bit here is: you are trying to solve a database problem using text files. Databases are designed to solve big data problems; text files, as you've discovered, are terrible at random access. Use a database, not a text file.

If you are hell-bent upon using a text file, what you have to do is take advantage of stuff you know about the likely problem parameters. For example, if you know that, as you imply, there are ~1M lines, each line is ~1KB, and the set of lines to extract is ~0.1% of the total lines, then you can come up with an efficient solution like this:

  • Make a set containing the line numbers to be read. The set must be fast to check for membership.
  • Make a dictionary that maps from line numbers to line contents. This must be fast to look up by key and fast to add new key/value pairs.
  • Read each line of the file one at a time; if the line number is in the set, add the contents to the dictionary.
  • Now iterate the list of line numbers and map the dictionary contents; now we have a sequence of strings.
  • Dump that sequence to the destination file.

We have five operations, so hopefully it is around five lines of code.

void DoIt(string pathIn, IEnumerable<int> lineNumbers, string pathOut)
{
  var lines = new HashSet<int>(lineNumbers);
  var dict = File.ReadLines(pathIn)
    .Select((lineText, index) => new KeyValuePair<int, string>(index, lineText))
    .Where(p => lines.Contains(p.Key))
    .ToDictionary(p => p.Key, p => p.Value);
  File.WriteAllLines(pathOut, lineNumbers.Select(i => dict[i]));
}

OK, got it in six. Pretty good.


Notice that I made use of all those assumptions; if the assumptions are violated then this stops being a good solution. In particular we assume that the dictionary is going to be small compared to the size of the input file. If that is not true, then you'll need a more sophisticated technique to get efficiencies.

Conversely, can we extract additional efficiencies? Yes, provided we know facts about likely inputs. Suppose for example we know that the same file will be iterated several times but with different line number sets, but those sets are likely to have overlap. In that case we can re-use dictionaries instead of rebuilding them. That is, suppose a previous operation has left a Dictionary<int, string> computed for lines (10, 20, 30, 40) and file X. If a request then comes in for lines (30, 20, 10) for file X, we already have the dictionary in memory.

The key thing I want to get across in this answer is that you must know something about the inputs in order to build an efficient solution; the more restrictions you can articulate on the inputs, the more efficient a solution you can build. Take advantage of all the knowledge you have about the problem domain.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • If you wanted to save memory, wouldn't it make sense to use `File.ReadLines` instead of `File.ReadAllLines` as the former will iterate over the file line by line instead of reading it into memory all at once? – germi Oct 10 '19 at 08:13
  • @germi: Yes, and I intended to write it as you suggest, and do not know why my fingers did not obey! – Eric Lippert Oct 10 '19 at 13:33
3

Use a StreamReader, so you don't have to read the entire file, just until the last desired line, and store them in a Dictionary, for later fast search.

Edit: Thanks to Erick Lippert, I included a HashSet for fast lookup.

List<int> lineNumbers = new List<int>{2,4,4,1};
HashSet<int> lookUp = new HashSet<int>(lineNumbers);
Dictionary<int,string> lines = new Dictionary<int,string>();

using(StreamReader sr = new StreamReader(inputFile)){
    int lastLine = lookUp.Max();
    for(int currentLine=1;currentLine<=lastLine;currentLine++){
        if(lookUp.Contains(currentLine)){
            lines[currentLine]=sr.ReadLine();
        }
        else{
            sr.ReadLine();
        }       
    }   
}
using(StreamWriter sw = new StreamWriter(outputFile)){
    foreach(var line in lineNumbers){
        sw.WriteLine(lines[line]);
    }
}
Magnetron
  • 7,495
  • 1
  • 25
  • 41
  • As I noted in the comments to the other answer that gives this solution: the problem is that the check for containment in the list can be inefficient if the list is large. See my answer for some ways you can make this code shorter, faster, and easier to follow. – Eric Lippert Oct 09 '19 at 19:36
  • 1
    @EricLippert thanks for the tip. I edited it to use a HashSet. You've got very good points in your answer, upvoted it. – Magnetron Oct 09 '19 at 19:56
2

You may use a StreamReader and ReadLine method to read line by line without shocking the memory:

var lines = new Dictionary<int, string>();
var indexesProcessed = new HashSet<int>();
var indexesNew = new List<int> { 2, 4, 4, 1 };

using ( var reader = new StreamReader(@"c:\\file.txt") )
  for ( int index = 1; index <= indexesNew.Count; index++ )
    if ( reader.Peek() >= 0 )
    {
      string line = reader.ReadLine();
      if ( indexesNew.Contains(index) && !indexesProcessed.Contains(index) )
      {
        lines[index] = line;
        indexesProcessed.Add(index);
      }
    }

using ( var writer = new StreamWriter(@"c:\\file-new.txt", false) )
  foreach ( int index in indexesNew )
    if ( indexesProcessed.Contains(index) )
      writer.WriteLine(lines[index]);

It reads the file and select the desired indexes then save them in the desired order.

We use a HashSet to store processed indexes to speedup Contains calls as you indicate the file can be over 1GB.

The code is made to avoid index out of bound in case of mismatches between the source file and the desired indexes, but it slows down the process. You can optimize if you are sure that there will be no problem. In this case you can remove all usage of indexesProcessed.

Output:

some text here
blah blah
blah blah
ogfile line 1
  • 1
    This is a good attempt, however it would be better to have *three* data structures: your dictionary, your index list, *and an index set*. The problem here is that the `indexesNew.Contains(index)` check is O(n). If you initialize a `HashSet` from `indexesNew` then checking the containment in the hash set is O(1). – Eric Lippert Oct 09 '19 at 19:09
  • I'm not used to using HashSet, thanks. Answer updated. Found that: https://stackoverflow.com/questions/150750/hashset-vs-list-performance. –  Oct 09 '19 at 19:15
  • See my answer for some thoughts on how you can make this code much shorter and easier to follow. – Eric Lippert Oct 09 '19 at 19:31
  • But `indexesNew` can't be a `HashSet` because of duplicates, isn't it? –  Oct 09 '19 at 19:44
  • Right, you need all three: the original list, out of order and containing duplicates, the de-duplicated fast-search set built from that list, and the dictionary. – Eric Lippert Oct 09 '19 at 19:46
  • 1
    Replacing `Contains` calls on the `List` for a `HashSet` is a great idea. The difference is enormous for big collections as `HashSet` remains at the same speed while `List` grows and grows in time! –  Oct 09 '19 at 19:47
-2

One way to do this would be to simply read the input file once (and store the result in a variable), and then grab the lines you need and write them to the output file.

Since the line number is 1-based and arrays are 0-based (i.e. line number 1 is array index 0), we subtract 1 from the line number when specifying the array index:

static void Main(string[] args)
{
    var inputFile = @"f:\private\temp\temp.txt";
    var outputFile = @"f:\private\temp\temp2.txt";

    var fileLines = File.ReadAllLines(inputFile);
    var linesToDisplay = new[] {2, 4, 4, 1};

    // Write each specified line in linesToDisplay from fileLines to the outputFile
    File.WriteAllLines(outputFile, 
        linesToDisplay.Select(lineNumber => fileLines[lineNumber - 1]));

    GetKeyFromUser("\n\nDone! Press any key to exit...");
}

Another way to do this that should be more efficient is to only read the file up to the maximum line number (using the ReadLines method), rather than reading the whole file (using the ReadAllLines method), and save just the lines we care about in a dictionary that maps the line number to the line text:

static void Main(string[] args)
{
    var inputFile = @"f:\private\temp\temp.txt";
    var outputFile = @"f:\private\temp\temp2.txt";

    var linesToDisplay = new[] {2, 4, 4, 1};
    var maxLineNumber = linesToDisplay.Max();
    var fileLines = new Dictionary<int, string>(linesToDisplay.Distinct().Count());

    // Start lineNumber at 1 instead of 0
    int lineNumber = 1;

    // Just read up to the largest line number we need 
    // and save the lines we care about in our dictionary
    foreach (var line in File.ReadLines(inputFile))
    {
        if (linesToDisplay.Contains(lineNumber))
        {
            fileLines[lineNumber] = line;
        }

        // Increment our lineNumber and break if we're done
        if (++lineNumber > maxLineNumber) break;
    }

    // Write the output to our file
    File.WriteAllLines(outputFile, linesToDisplay.Select(line => fileLines[line]));

    GetKeyFromUser("\n\nDone! Press any key to exit...");
}
Rufus L
  • 36,127
  • 5
  • 30
  • 43