4

Is there a better [pre-existing optional Java 1.6] solution than creating a streaming file reader class that will meet the following criteria?

  • Given an ASCII file of arbitrary large size where each line is terminated by a \n
  • For each invocation of some method readLine() read a random line from the file
  • And for the life of the file handle no call to readLine() should return the same line twice

Update:

  • All lines must eventually be read

Context: the file's contents are created from Unix shell commands to get a directory listing of all paths contained within a given directory; there are between millions to a billion files (which yields millions to a billion lines in the target file). If there is some way to randomly distribute the paths into a file during creation time that is an acceptable solution as well.

cfeduke
  • 23,100
  • 10
  • 61
  • 65
  • 1
    Can you pad the lines so each is the same length? – McDowell Jan 15 '13 at 13:31
  • Yes I could pad the lines. – cfeduke Jan 15 '13 at 13:35
  • If every record were the same length you can calculate the number of lines from the file length and seek to a line start by multiplying the record number by the length. – McDowell Jan 15 '13 at 13:38
  • As soon as you asked that question I realized that padding would help reduce the complexity of the solution greatly. – cfeduke Jan 15 '13 at 13:48
  • To eliminate duplicates, move the last record in the file to the position you've last read and truncate the last record from the file - assuming error handling for randomly getting to the end and the ability to destroy the file as part of processing. – McDowell Jan 15 '13 at 13:51
  • Can there be duplicate lines in the file? – fge Jan 15 '13 at 13:59
  • There could potentially be duplicates in a general solution so I won't rule them out but its not possible assuming the Unix shell tools don't have any bugs for my particular solution. – cfeduke Jan 15 '13 at 14:40
  • Another question: do you need to read the file fully, ie each and every line, or only some lines from it? – fge Jan 15 '13 at 14:51
  • By duplicates, I meant hitting the same line more than once by chance – McDowell Jan 15 '13 at 15:03
  • @McDowell I was replying to fge's question about duplicate lines – cfeduke Jan 15 '13 at 15:08

4 Answers4

5

In order to avoid reading in the whole file, which may not be possible in your case, you may want to use a RandomAccessFile instead of a standard java FileInputStream. With RandomAccessFile, you can use the seek(long position) method to skip to an arbitrary place in the file and start reading there. The code would look something like this.

RandomAccessFile raf = new RandomAccessFile("path-to-file","rw");
HashMap<Integer,String> sampledLines = new HashMap<Integer,String>();
for(int i = 0; i < numberOfRandomSamples; i++)
{
    //seek to a random point in the file
    raf.seek((long)(Math.random()*raf.length()));

    //skip from the random location to the beginning of the next line
    int nextByte = raf.read();
    while(((char)nextByte) != '\n')
    {
        if(nextByte == -1) raf.seek(0);//wrap around to the beginning of the file if you reach the end
        nextByte = raf.read();
    }

    //read the line into a buffer
    StringBuffer lineBuffer = new StringBuffer();
    nextByte = raf.read();
    while(nextByte != -1 && (((char)nextByte) != '\n'))
        lineBuffer.append((char)nextByte);

    //ensure uniqueness
    String line = lineBuffer.toString();
    if(sampledLines.get(line.hashCode()) != null)
        i--;
    else
       sampledLines.put(line.hashCode(),line);
}

Here, sampledLines should hold your randomly selected lines at the end. You may need to check that you haven't randomly skipped to the end of the file as well to avoid an error in that case.

EDIT: I made it wrap to the beginning of the file in case you reach the end. It was a pretty simple check.

EDIT 2: I made it verify uniqueness of lines by using a HashMap.

Paul Bellora
  • 54,340
  • 18
  • 130
  • 181
  • 1
    The primary issue I see with this is that as `i` gets large it will take longer and longer to find an un-read line--potentially a *very* long time. I like the idea, though. – Dave Newton Jan 16 '13 at 15:07
  • I like this solution for randomly sampling a subset of lines, as long as the desired subset size is not very large. – cfeduke Jan 16 '13 at 15:18
2

Pre-process the input file and remember the offset of each new line. Use a BitSet to keep track of used lines. If you want to save some memory, then remember the offset of every 16th line; it is still easy to jump into the file and do a sequential lookup within a block of 16 lines.

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
2

Since you can pad the lines, I would do something along those lines, and you should also note that even then, there may exist a limitation with regards to what a List can actually hold.

Using a random number each time you want to read the line and adding it to a Set would also do, however this ensures that the file is completely read:

public class VeryLargeFileReading
    implements Iterator<String>, Closeable
{
    private static Random RND = new Random();
    // List of all indices
    final List<Long> indices = new ArrayList<Long>();
    final RandomAccessFile fd;

    public VeryLargeFileReading(String fileName, long lineSize)
    {
        fd = new RandomAccessFile(fileName);
        long nrLines = fd.length() / lineSize;
        for (long i = 0; i < nrLines; i++)
            indices.add(i * lineSize);
        Collections.shuffle(indices);
    }

    // Iterator methods
    @Override
    public boolean hasNext()
    {
        return !indices.isEmpty();
    }

    @Override
    public void remove()
    {
        // Nope
        throw new IllegalStateException();
    }

    @Override
    public String next()
    {
        final long offset = indices.remove(0);
        fd.seek(offset);
        return fd.readLine().trim();
    }

    @Override
    public void close() throws IOException
    {
        fd.close();
    }
}
fge
  • 119,121
  • 33
  • 254
  • 329
  • one problem with this is that he still has to read the whole file, and since he said there may be billions, lets say even 1 billion, and longs are each 8 bytes, he would need at least 8GB of RAM just to hold the list –  Jan 15 '13 at 13:41
  • Yes, that is true. But retaining the state will require as much eventually, so... – fge Jan 15 '13 at 13:42
  • what do you mean by retaining the state? do you mean tracking the lines that have already been used? because a hashmap could track that without storing all of the indices in the file. –  Jan 15 '13 at 13:47
  • I mean retaining the offsets. No need for a map. Also, since the OP says lines can be padded, there is another solution. – fge Jan 15 '13 at 13:53
  • why do you need to retain the offsets of every line, he only wants a random subset of the lines? plus padding the lines seems like a bit overkill for this. –  Jan 15 '13 at 13:59
  • I don't know whether only a subset of the lines need to be read, this is not mentioned, maybe you are right... – fge Jan 15 '13 at 14:03
  • he mentioned that he wanted to do a random distribution, which is a type of sampling technique. at least that was my impression –  Jan 15 '13 at 14:05
  • All lines must eventually be read, I realize that was not in the original question so I'll update accordingly. – cfeduke Jan 16 '13 at 15:15
  • OK then, my solution can do this but it requires to store all indices, and there may be billions of them, which is not nice. A PRNG algorithm would need to be found which cycles, so that the initial value can serve as a sentinel so as to detect when to stop. Maybe such a PRNG algorithm can be computed from the number of lines... – fge Jan 16 '13 at 15:40
1

If the number of files is truly arbitrary it seems like there could be an associated issue with tracking processed files in terms of memory usage (or IO time if tracking in files instead of a list or set). Solutions that keep a growing list of selected lines also run in to timing-related issues.

I'd consider something along the lines of the following:

  1. Create n "bucket" files. n could be determined based on something that takes in to account the number of files and system memory. (If n is large, you could generate a subset of n to keep open file handles down.)
  2. Each file's name is hashed, and goes into an appropriate bucket file, "sharding" the directory based on arbitrary criteria.
  3. Read in the bucket file contents (just filenames) and process as-is (randomness provided by hashing mechanism), or pick rnd(n) and remove as you go, providing a bit more randomosity.
  4. Alternatively, you could pad and use the random access idea, removing indices/offsets from a list as they're picked.
Dave Newton
  • 158,873
  • 26
  • 254
  • 302
  • Discussed this with Dave off of SO; when he's referring to files that's going back to the root of the problem where I am creating the arbitrarily large file and each line represents relative path to a filename - thought his solution would work for any string. (Replace `lines` with `files` in his answer and it applies.) – cfeduke Jan 16 '13 at 15:11
  • Err... `files` with `lines`... and then it applies generically. – cfeduke Jan 16 '13 at 15:17