8

I have a file with 5000+ lines. I want to find the most efficient way to choose one of those lines each time I run my program. I had originally intended to use the random method to choose one (that was before I knew there were 5000 lines). Thought that might be inefficient so I thought I'd look at reading the first line, then deleting it from the top and appending it to the bottom. But it seems that I have to read the whole file and create a new file to delete from the top.

What is the most efficient way: the random method or the new file method?

The program will be run every 5 mins and I'm using c# 4.5

Loofer
  • 6,841
  • 9
  • 61
  • 102
Angela Baines
  • 99
  • 1
  • 5
  • 4
    Seek to a random offset in the file, then scan forward for a newline character. Read data until the next newline. Take your precautions with the end of the file. The probability won't be uniform if the lines have big length differences though. Oh, and 5000 isn't *that* much ;-) – Lucas Trzesniewski Sep 01 '15 at 09:57
  • Break it to 50 files with 100 lines each, random numb 0-50 for file, random line 0-99 for line. Having said that, reading 5000 lines every 5 minutes is still not a big issue ... not efficient, but not a real issue. If this is your only problem with the app, you're good :) – Noctis Sep 01 '15 at 10:01
  • how large is the file in total? – olydis Sep 01 '15 at 10:29
  • You should seek a random offset in the file, as @LucasTrzesniewski wrote. But keep in mind, this solutions is not worked, if you file has variable character length encoding, such as UTF-8. – Mark Shevchenko Sep 01 '15 at 10:47
  • have you check this one? http://stackoverflow.com/questions/15786612/read-text-file-at-specific-line – ken lacoste Sep 01 '15 at 10:59
  • 1
    Efficiency is a measure of performance versus cost, but you have not said either what aspect of performance you are interested in or what aspect of cost you are interested in. If you are unable to characterize performance and cost then no one can answer this question for you; if you are able to measure both then *measure them both ways and you will know the answer*. – Eric Lippert Sep 01 '15 at 13:09

4 Answers4

2

In .NET 4.*, it is possible to access a single line of a file directly. For example, to get line X:

string line = File.ReadLines(FileName).Skip(X).First();

Full example:

var fileName = @"C:\text.txt"
var file = File.ReadLines(fileName).ToList();
int count = file.Count();
Random rnd = new Random();
int skip = rnd.Next(0, count);
string line = file.Skip(skip).First();
Console.WriteLine(line);
randoms
  • 2,793
  • 1
  • 31
  • 48
  • 2
    `File.ReadLines(FileName).Skip(X).Take(1).First()` can be simplified to `File.ReadLines(FileName).Skip(X).First()` – olydis Sep 01 '15 at 11:02
  • Your full example is reading the entire file into memory twice. – theB Sep 01 '15 at 11:17
  • That is true, I've edited the code. Execution time on my system went from ~8ms to ~6ms by reading the file only once. Thanks – randoms Sep 01 '15 at 12:21
  • 1
    you've also got `var file` declared twice too, but looks like it should be `var fileName` for the first declaration – Simon Price Sep 01 '15 at 13:08
  • 1
    Your new code is reading the entire file into memory. Depending on Angela's (or a future visitor's) definition of "Large Text File", this may cause page faulting. – Brian Sep 01 '15 at 20:49
2

Lets assume file is so large that you cannot afford to fit it into RAM. Then, you would want to use Reservoir Sampling, an algorithm designed to handle picking randomly from lists of unknown, arbitrary length that might not fit into memory:

Random r = new Random();
int currentLine = 1;
string pick = null;
foreach (string line in File.ReadLines(filename)) 
{
    if (r.Next(currentLine) == 0) {
        pick = line;
    }
    ++currentLine;
}   
return pick;

At a high level, reservoir sampling follows a basic rule: Each further line has a 1/N chance of replacing all previous lines.

This algorithm is slightly unintuitive. At a high level, it works by having line N have a 1/N chance of replacing the currently selected line. Thus, line 1 has a 100% chance of being selected, but a 50% chance of later being replaced by line 2.

I've found understanding this algorithm to be easiest in the form of a proof of correctness. So, a simple proof by induction:

1) Base case: By inspection, the algorithm works if there is 1 line.
2) If the algorithm works for N-1 lines, processing N lines works because:
3) After processing N-1 iterations of an N line file, all N-1 lines are equally likely (probability 1/(N-1)).
4) The next iteration insures that line N has a probability of 1/N (because that's what the algorithm explicitly assigns it, and it is the final iteration), reducing the probability of all previous lines to:

1/(N-1) * (1-(1/N))  
1/(N-1) * (N/N-(1/N))  
1/(N-1) * (N-1)/N  
(1*(N-1)) / (N*(N-1))  
1/N

If you know how many lines are in the file in advance, this algorithm is more expensive than necessary, as it always reads the entire file.

Brian
  • 25,523
  • 18
  • 82
  • 173
0

I assume that the goal is to randomly choose one line from a file of 5000+ lines.

Try this:

  1. Get the line count using File.ReadLines(file).Count().
  2. Generate a random number, using the line count as an upper limit.
  3. Do a lazy read of the file with File.ReadLines(file).
  4. Choose a line from this array using the random number.

EDIT: as pointed out, doing File.ReadLines(file).toArray() is pretty inefficient.

John Go-Soco
  • 886
  • 1
  • 9
  • 20
  • 1
    From all the things suggested in the comments so far, this would be the most inefficient solution. Both of your calls to `File.ReadLines` will read the entire file (calling `ToArray` makes step 3 **anything but lazy**) - other than that: yes you assume right, since this is exactly what he asked – olydis Sep 01 '15 at 10:27
  • Oh, absolutely right. I'll remove the ToArray() method call. But you are also right in that this is not exactly the most efficient method anyway. – John Go-Soco Sep 01 '15 at 10:36
  • 1
    Still you read the file more than once: `File.ReadAllLines` would be faster, as in http://stackoverflow.com/questions/3745934/read-random-line-from-a-file-c-sharp – olydis Sep 01 '15 at 10:38
  • I wouldn't do File.ReadAllLines, as that does return a giant array (which is what File.ReadLines().toArray() does as well). The point of using FIle.readLines is that it returns an IEnumerable. As in: http://stackoverflow.com/questions/21969851/what-is-diffrence-between-file-readlines-and-file-readalllines – John Go-Soco Sep 01 '15 at 10:41
  • I am well aware of that... depending on the hardware, for <10MB files I bet one bulk read will be faster than 2 semi-lazy runs. I'm picky, your approach is fine of course. ;) – olydis Sep 01 '15 at 11:01
0

Here's a quick implementation of @LucasTrzesniewskis proposed method in the comments to the question:

// open the file
using(FileStream stream = File.OpenRead("yourfile.dat"))
{
    // 1. index all offsets that are the beginning of a line
    List<Long> lineOffsets = new List<Long>();
    lineOffsets.Add(stream.Position); //the very first offset is a beginning of a line!
    int ch;
    while((ch = stream.ReadByte()) != -1) // "-1" denotes the end of the file
    {
        if(ch == '\n')
            lineOffsets.Add(stream.Position);
    }

    // 2. read a random line
    stream.Seek(0, SeekOrigin.Begin); // go back to the beginning of the file
    // set the position of the stream to one the previously saved offsets
    stream.Position = lineOffsets[new Random().Next(lineOffsets.Count)];
    // read the whole line from the specified offset
    using(StreamReader reader = new StreamReader(stream))
    {
        Console.WriteLine(reader.ReadLine());
    }
}

I don't have any VS near me at the moment, so this is untested.

Community
  • 1
  • 1
KeyNone
  • 8,745
  • 4
  • 34
  • 51
  • 1
    This could potentially blow up if you have a file with multi-byte characters like UTF-8 (which can take 1-6 bytes per character) and the offset you randomly pick happens to be in the middle of one of those characters. – Scott Chamberlain Sep 01 '15 at 12:56