3

How does one read and split/chunk a file by the number of lines?

I would like to partition a file into separate buffers, while ensuring that a line is not split up between two or more buffers. I plan on passing these buffers into their own pthreads so they can perform some type of simultaneous/asynchronous processing.

I've read the answer below reading and writing in chunks on linux using c but I don't think it exactly answers the question about making sure that a line is not split up into two or more buffers.

Community
  • 1
  • 1
Ken
  • 1,498
  • 2
  • 12
  • 19
  • 2
    These kind of schemes just never work out as intended. You've got a processor with multiple cores, a nice way to make multiple threads productive. But you still only have one disk, threads are just waiting for a turn to read from it. – Hans Passant Nov 21 '12 at 00:45
  • @HansPassant: If the OP knows the task is CPU bound, that may not end up being the case. But yes, you are most likely correct. Though `pbzip` and `xz` both use this technique on a block level to great effect. – Omnifarious Nov 21 '12 at 01:27
  • I actually didn't think about that. I've since thought of cleaner way of solving my macro problem that doesn't involve using the question I posed. I'm still curious as to what the answer is though! – Ken Nov 21 '12 at 08:02
  • 1
    Even if one has a single disk (as apposed to having a raid-0 array, for example), the IO of one thread can be interleaved with the CPU processing of another thread. These kind of schemes have been working well for the past 50 or so years. Even on a single CPU. – chill Nov 21 '12 at 11:40
  • Wanted to follow up on this question. I'm re-exploring the possibility of chunking my file and passing it to N threads for asynchronous processing. Would it be performant if I were to read the file line by line and pass each line to a thread? I suspect this would also be file IO bound. – Ken Nov 28 '12 at 20:02

3 Answers3

2

How is the file encoded? If it each byte represents a character, I would do the following:

  1. Memory map the file using mmap().
  2. Tell the jobs their approximate start and end by computing it based on an appropriate chunk size.
  3. Have each job find its actual start and end by finding the next '\n'.
  4. Process the respective chunks concurrently.
  5. Note that the first chunk needs special treatment because its start isn't approximate but exact.
Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
1

I would choose a chunk size in bytes. Then I would seek to the appropriate location in the file and read some smallish number of bytes at a time until I got a newline.

The first chunk's last character is the newline. The second chunk's first character is the character after the newline.

Always seek to a pagesize() boundary and read in pagesize() bytes at a time to search for your newline. This will tend to ensure that you only pull the minimum necessary from disk to find your boundaries. You could try reading like 128 bytes at a time or something. But you then risk making more system calls.

I wrote an example program that does this for letter frequency counting. This, of course, is largely pointless to split into threads as it's almost certainly IO bound. And it also doesn't matter where the newlines are because it isn't line oriented. But, it's just an example. Also, it's heavily reliant on you having a reasonably complete C++11 implementation.

They key function is this:

// Find the offset of the next newline given a particular desired offset.
off_t next_linestart(int fd, off_t start)
{
   using ::std::size_t;
   using ::ssize_t;
   using ::pread;

   const size_t bufsize = 4096;
   char buf[bufsize];

   for (bool found = false; !found;) {
      const ssize_t result = pread(fd, buf, bufsize, start);
      if (result < 0) {
         throw ::std::system_error(errno, ::std::system_category(),
                                   "Read failure trying to find newline.");
      } else if (result == 0) {
         // End of file
         found = true;
      } else {
         const char * const nl_loc = ::std::find(buf, buf + result, '\n');
         if (nl_loc != (buf + result)) {
            start += ((nl_loc - buf) + 1);
            found = true;
         } else {
            start += result;
         }
      }
   }
   return start;
}

Also notice that I use pread. This is absolutely essential when you have multiple threads reading from different parts of the file.

The file descriptor is a shared resource between your threads. When one thread reads from the file using ordinary functions it alters a detail about this shared resource, the file pointer. The file pointer is the position in the file at which the next read will occur.

Simply using lseek before you read each time does not help this because it introduces a race condition between the lseek and the read.

The pread function allows you to read a bunch of bytes from a specific location within the file. It also doesn't alter the file pointer at all. Apart from the fact that it doesn't alter the file pointer, it's otherwise like combining an lseek and a read in the same call.

Omnifarious
  • 54,333
  • 19
  • 131
  • 194
  • How do you know when you hit a newline? – Ken Nov 21 '12 at 08:00
  • 1
    @Ken: You scan the buffer you read for a newline character. Keep a note of the file offset of the beginning of the buffer (where you lseek'ed to) and the offset within the buffer of the newline. Add the two and you get the file offset for a newline that's close to your chunk size. – Omnifarious Nov 21 '12 at 08:06
  • @Ken - I wrote you a sample program using my technique, mostly. I don't go to the trouble of trying to make sure file accesses are page aligned. If you really wanted to make things as fast as possible, you would make sure that all file access was page aligned if possible, and also that buffers were also page aligned. This allows the OS to optimize reads by simply mapping the buffer into your process' address space rather than copying. – Omnifarious Nov 22 '12 at 20:06
0

Define a class for the buffers. Give each one a large buffer space that is some multiple of page size and a start/end index, a method that reads the buffer from a passed-in stream and a 'lineParse' method that takes another *buffer instance as a parameter.

Make some *buffers and store them on a producer-consumer pool queue. Open the file, get a buffer from the pool and read into the buffer space from start to end, (return a boolean for error/EOF). Get another *buffer from the pool and pass it into the lineparse() of earlier one. In there, search backwards from the end of the data, looking for newLine. When found, reload the end index and memcpy the fragment of the last line, (if there is one - you might occasionally be lucky:), into the new, passed *buffer and set its start index. The first buffer now has whole lines and can be queued off to the thread/s that will process the lines. The second buffer has the fragment of line copied from the first and more data can be read from disk into its buffer space at its start index.

The line-processing thread/s can recycle the 'used' *buffers back to the pool.

Keep going until EOF, (or error:).

If you can, add a method to the buffer class that does the processing of the buffer.

Using large buffer classes and parsing back from the end will be mure efficient than continually reading small bits, looking for newlines from the start. Inter-thread comms is slow and the larger the buffers you can pass, the better.

Using a pool of buffers eliminates continual new/delete and provides flow-control - if the disk read thread is faster than the processing, the pool will empty and the disk read thread will block on it until some used buffers are recycled. This prevents memory runaway.

Note that if you use more than one processing thread, the buffers may get processed 'out-of-order' - this may, or may not, matter.

You can only gain in this scenario by ensuring that the advantage of lines being processed in parallel with disk-read latencies is greater than the overhead of inter-thread comms - communicating small buffers between threads is very likely to be counter-productive.

The biggest speedup would be experienced with networked disks that are fast overall, but have large latencies.

Martin James
  • 24,453
  • 3
  • 36
  • 60