How can I return a random line from a file? Interview Question

Question

I am preparing for a phone interview. I came upon these questions on the internet. Can anyone tell me some good answers for these?

Suppose I give you a text file and ask you a to write a program that will return a random line from the file (all lines must have equal probability to be returned)
Same as part 1, except this time the entire text file cannot fit into main memory
Same as part 2, except now you have a stream instead of a file.

Please help.

Ok...@Everyone, I really had a few ideas in my mintod before asking this...Seeing the relentless attack by my fellow SOers, I am posting my answers. Please feel free to attack them too...

1: Count the number of '\n' in the file. Generate a random number between 1 and the number and return the line after the number-1 '\n'.

2: Bring the file into main memory part by part and follow the above procedure.

3: I dont have much idea about this and would appreciate any inputs.

Its wonderful that you guys really give an inspiration to push forward.....

@Adam: wait, what's wrong with asking programming related questions on SO? — Konrad Rudolph, Jan 06 '10 at 21:10
Are you planning to let Stack Overflow do your work when you get hired? — Jimmy Shelter, Jan 06 '10 at 21:11
Why not post what answers you have here and then we can suggest things based on that? — John, Jan 06 '10 at 21:11
The question could be interesting (although imo this one isn't that interesting), but the motives to ask it is very wrong. — Jimmy Shelter, Jan 06 '10 at 21:12
If he can't answer these questions already, then what is the point of the interview? — dreamlax, Jan 06 '10 at 21:12
Doesn't seem very ethical to me to help someone pass a tech screen that they can't otherwise pass, especially when it covers very basic programming questions. — Adam Crossland, Jan 06 '10 at 21:12
At least improve the name of your question tp target the particular questions you are asking. I don't want to see a hundred "interview Question" "questions" on SO. — reinierpost, Jan 06 '10 at 21:12
Have you guys read _I am preparing for a phone interview_? What's wrong about preparing to answer questions? For all we know those question's won't even come up in the interview. — Georg Schölly, Jan 06 '10 at 21:16
If he is preparing for a phone interview, why doesn't he post his solutions to the problems for critique? He might learn something from that and thus be better prepared. — Adam Crossland, Jan 06 '10 at 21:20
Duplicate: http://stackoverflow.com/questions/232237/whats-the-best-way-to-return-a-random-line-in-a-text-file-using-c — Mark Ransom, Jan 06 '10 at 21:22
@Adam: I was reading your initial response as being a little inflammatory, but your bigger point does make sense -- these questions are much easier and more relevant if the original poster does a little bit of homework first. — Jim Dagg, Jan 06 '10 at 21:22
Jim -- you're right about my initial comment. It did nothing to illuminate the point that I wanted to make, so I have deleted it. My reaction was somewhat emotional; as a person who does a great deal of phone screening of potential programmers, it irritates me to deal with people who think that they can fake their way into a programming job. It galls me even more to have someone ask me to do all of the work for them. — Adam Crossland, Jan 06 '10 at 21:25
I have voted to close as 'not a real question' simply for the following reason: when asked **as an interview question**, this question is not about finding 'the correct' (or even 'a correct') answer. For it to be posted here as a question (or indeed for answers to be given) suggests that asker (or answerer) believes such an answer exists. Which it doesn't. — AakashM, Jan 06 '10 at 23:16

score 23 · Answer 1 · edited Jan 06 '10 at 23:32

23

Read all lines into an array, return a random line in the range of 1 and the amount of lines.
Most simple: Count the lines, choose a line number at random, go through the file a second time and return it.

You just have to remember one line. Each new line has a probability of 1/N (N being lines read).

Pseudocode:

i = 1
chosen_line = ""
for line in lines:
    if random() < 1/i:    # random returns a uniform random number in [0,1)
        chosen_line = line
    i += 1
return chosen_line

Algorithm number 3 could be used for 1 and 2 too.

edited Jan 06 '10 at 23:32

Alok Singhal

93,253
21
125
158

answered Jan 06 '10 at 21:15

Georg Schölly

124,188
49
220
267

2

Your solution #3 is correct but perhaps a bit confusing... to clarify, at each line you read, the chance that you should choose the new line will be 1/N where N is the number of lines you've read. Saying to "choose" (1,2,3) for example is unnecessary and (IMO) confusing. Just keep track of which line you chose last, and update percentages as you go. +1. – Michael Bray Jan 06 '10 at 21:22
@dreamlax: that's the point of my last comment... you ONLY have to keep track of the one line that you've chosen, and each new line that you read will have a 1/N chance of replacing that line. N is the number of lines read "so far", not the total number of lines in the file. – Michael Bray Jan 06 '10 at 21:23
1: No real sense reading in the whole file into an array if you only want one line. 2: Or better yet (though less random, probably): fstat() to get the file size, pick a random point, and read forward/backward from that point until you have a full line of text. – KingRadical Jan 06 '10 at 21:23
1

Solution 3 works and is the same that I came up with. It's apparently a well known algorithm too. It does *not* require the entire file in memory, just the most recently selected line. See http://stackoverflow.com/questions/232237/whats-the-best-way-to-return-a-random-line-in-a-text-file-using-c – Mark Ransom Jan 06 '10 at 21:24
@KingRadical: that depends... on how big the file is, how fast your system is, etc etc... reading the entire file at once probably maximizes the efficiency of disk thruput, if that turns out to be your bottleneck. – Michael Bray Jan 06 '10 at 21:25
@Michael Bray: Sorry, I deleted my comment, I just realised after I re-read it. – dreamlax Jan 06 '10 at 21:34
I don't think you can argue much about question 1 and 3, they are quite clear. The real question is if there's a better approach to 2. – Georg Schölly Jan 06 '10 at 21:36
1

gs: small comment error in your code... random() should NOT return 1 - it must be 0 (inclusive) to 1 (non-inclusive) otherwise there is a chance (however small) that a one-line file wouldn't select any line. either that or change < to <= – Michael Bray Jan 06 '10 at 21:43
@gs: on whether there is a better approach to #2... I think within the context of the question, the fundamental difference between #2 and #3 is that #3 is implying a forward-only approach, meaning you can't re-read the data, which is why the algorithm provided is necessary. To me, this implies that you should utilize that extra capability in #2. Thus, I think your answer for #2 is about as good as there is. – Michael Bray Jan 06 '10 at 21:45
The real problem I see with #2, which is a fairly common optimization problem, is that then you are iterating over a data set twice when you really do not need to. The actual time difference between that and an algorithm that only iterates once, or does not iterate over the data set at all may be negligible for small files, but it is inherently less efficient. – KingRadical Jan 06 '10 at 21:49
To me it looks like those are iterative questions. The first one is super-simple, then the interviewee has to think about the second one, the first thing he'll think of is looping twice over it. The third is deliberately not solvable with the approach used in the second. My point is, the second one might be only there to lead to interviewee into the wrong corner. – Georg Schölly Jan 06 '10 at 21:55
2

@KingRadical: picking a random point and getting that line won't give every line equal probability to be picked up. Say, file with two lines, the first line is 10 characters, the second one is 30 characters. The second line has 75% chance of being picked, instead of 50%. – sbk Jan 06 '10 at 21:55
Use your heads people - if you have a solution to problem number 3 (read from a stream where you don't know the length ahead of time), then you ALREADY have an excellent solution to the first two cases! – Stephen C. Steel Jan 06 '10 at 21:57
In solution 3, how can you choose a random line when you don't know the total number of lines? Otherwise, two passes must be made (at least). – Thomas Matthews Jan 06 '10 at 21:58
@Stephen: in line with @gs comments, of course #3 will work.. that's not the point of the three questions... as he said, they are iterative in nature. – Michael Bray Jan 06 '10 at 22:01
@Thomas: no you don't need two passes. Basically, at each line you read, you determine the chance that the current line WOULD HAVE been picked if it was the last line in the file. The chances of choosing this line decrease with each new line that you read. – Michael Bray Jan 06 '10 at 22:02
@gs: I hope you don't mind, but I changed your comment about "random". It should return a uniform random integer in [0,1), not [0,1]. Otherwise, the first line has a slightly smaller probability of being selected. As Michael said, for 1-line file, there is an infinitesimally small probability of not selecting any line. – Alok Singhal Jan 06 '10 at 23:34
@Thomas: if the stream only has one line, then that line has a 1 in 1 chance of being picked. If the stream has two lines, then each line has a 1 in 2 chance of being picked. With each new line read, the probability that it will be picked is 1/N, where N is the number of lines read so far. – John Bode Jan 06 '10 at 23:46
@Alok: Thanks, I'm always glad when someone fixes something in my answers. I'm still trying to figure out if [0,1) or [0,1] with <= is correct. Does it make difference? – Georg Schölly Jan 07 '10 at 08:42
1

@jslap: in practice, it works quite well (code up a prototype to play with and see). After reading N lines, the chance that the selected line is still the first line read is 1 in N-1. – John Bode Jan 07 '10 at 18:03
The quality starts to degrade after some time because of inaccuracies of doubles. (Unless you create your own random generator that works with fractions of course.) – Georg Schölly Jan 07 '10 at 21:47
@jslap: it is not so. I have included a proof in my answer. – Alok Singhal Jan 12 '10 at 07:46
@gs: Mathematically, I think [0,1) with < or [0,1] with <= are equivalent because of infinite real numbers in [0,1]. In floating-point terms, it *might* make a difference, but I don't think so. – Alok Singhal Jan 12 '10 at 07:47
Sorry. I agree it is a good solution. – jslap Jan 13 '10 at 14:31

Alok Singhal · Answer 2 · 2010-01-07T00:24:50.567

You can do this without having to read all the lines in memory, thus working well for huge files. Pseudocode:

linenum := 0
ret := ''
while more lines to read:
    line := readline()
    linenum := linenum + 1
    r := uniform_random(0, linenum)
    if r < 1:
        ret := line

return ret

Proof: We begin by noting that we always save the first line in ret. If the file has one line, you are going to choose it, and you're done.

For two-line file, ret will save the first line 100% of the times, and the second line will be saved in ret 50% of the time during the second iteration of the loop. Thus, each line has a probability of 0.5 of being selected.

Now, let's assume that this method works for files of ≤N lines. To prove that this means it works for N+1, in the (N+1)th iteration of the loop, there is a probability of 1/(N+1) that the last line will be selected (random(0, N+1) < 1 has that probability). Thus, the last line has 1/(N+1) probability of being selected. The probability of all other lines being selected is still going to be equal to each other, let's call this x. Then, N*x + 1/(N+1) == 1, which means that x = 1/(N+1).

Proof by induction is complete.

Edit: Oops, didn't see the first answer's third method before replying. Still, I will keep this post here if only for the proof, and the opportunity for other people to correct it if there are any errors in it.

@jslap: yes. I did it by induction because it was an interesting exercise for me. :-) — Alok Singhal, Jan 13 '10 at 22:54

score 1 · Answer 3 · answered Jan 06 '10 at 21:14

Re 1: Use solution to 2

Re 2: You would want to scan the whole file using a RandomAccessFile access to count the number of lines and (possibly) cache the file pointers for each start of line. Then you could choose one at random (I'm assuming this question is not about how to generate random numbers) and move back to that start point, read the line and return it. If you want it fast then make sure you are buffering the reads (raf is v slow otherwise).

Re 3: If the stream doesn't fit in memory (i.e. you cannot cache the whole thing) and you don't know how many lines are in the stream without reading the whole stream (assuming you only get to read it once) then I cannot see a solution. I too wait for answers...

you can do it without knowing the number of lines and without reading all the lines in memory. See my answer for details. — Alok Singhal, Jan 06 '10 at 23:08

score -1 · Answer 4 · answered Jan 06 '10 at 21:27

-1

#3: write the stream to a file on disk and use solution 2. Not the most efficient solution, of course, but very simple.

answered Jan 06 '10 at 21:27

Doc Brown

19,739
7
52
88

#4: the stream doesn't fit on disk, (if you prefer: the device doesn't have a writeable filesystem). At least, that's the next thing I'd say in an interview, assuming I was setting this problem in the first place. – Steve Jessop Jan 06 '10 at 23:02
Yep, but that was not the OPs question ;-) – Doc Brown Jan 07 '10 at 06:44

How can I return a random line from a file? Interview Question

4 Answers4