0

I am looking to pull certain groups of lines from large (~870,000,000 line/~4GB) text files. As a small example, in a 50 line file I might want lines 3-6, 18-27, and 39-45. Using SO to start, and writing some programs to benchmark with my data, it seems that fortran90 has given me the best results (as compared with python, shell commands (bash), etc...).

My current scheme is simply to open the file and use a series of loops to move the read pointer to where I need and writing the results to an output file.

With the above small example this would look like:

    open(unit=1,fileName)
    open(unit=2,outFile)

    do i=1,2
      read(1,*)
    end do
    do i=3,6
      read(1,*) line
      write(2,*) line
    end do
    do i=7,17
      read(1,*)
    end do
    do i=18,27
      read(1,*) line
      write(2,*) line
    end do
    do i=28,38
      read(1,*)
    end do
    do i=39,45
      read(1,*) line
      write(2,*) line
    end do

*It should be noted I am assuming buffered i/o when compiling, although this seems to only minimally speed things up.

I am curious if this is the most efficient way to accomplish my task. If the above is in fact the best way to do this with fortran90, is there another language more suited to this task?

*Update: Made sure I was using buffered i/o, manually finding the most efficient blocksize/blockcount. That increased speed by about 7%. I should note that the files I am working with do not have a fixed record length.

ben
  • 113
  • 6
  • The fastest easy way to read a large text file? Put it on a really fast disk system, and don't shoot yourself in the foot by writing slow code to read it. There are more complex solutions, but they're very OS-dependent, and the actual IO patterns are very hardware-dependent because the *fastest* ways usually completely bypass the page cache. For what it's worth, `mmap()` is usually a very bad choice when you just want to stream a file from beginning to end without ever rereading any part of it. `mmap()` works best when you have to do numerous re-reads from random locations throughout the file. – Andrew Henle Jan 05 '17 at 00:11
  • Maybe using a `STREAM` access would be faster than skipping specific lines in a sequential way. – Coriolis Jan 05 '17 at 09:31
  • Thank you, I'll have a look at how this might work using stream i/o. – ben Jan 06 '17 at 02:59
  • with stream access you would need to parse the file looking for end of line markers, I don't see how that helps.. ( Unless maybe if all lines are exactly the same length. ) – agentp Jan 09 '17 at 22:43
  • Yeah, I came to the same conclusion that stream access wouldn't help. Unfortunately the entries are not of equal length so that cuts out some easy solutions. Ended up sticking with a scheme similar to my original above. There was a solution in C++: http://stackoverflow.com/questions/26736742/efficiently-reading-a-very-large-text-file-in-c that looked somewhat promising, but I'd rather just give my code extra time than learn C++ for this one task. – ben Jan 16 '17 at 23:22

2 Answers2

0

You can also try to use sed utility.

sed '3,6!d' yourfile.txt
sed '18,27!d' yourfile.txt

Unix utilities tend to be very optimized and to solve easy tasks like this very fast.

  • Thanks for the response! I did attempt a sed solution (i.e. sed -n “3,6p;18,27p;39,45p;45q” fileName) and it seems to take roughly double the time of my fortran90 solution no matter how it is structured (awk is similar). It would matter less if the files weren't so large but I'm looking for the fastest possible solution, even if it is a bit more complex. – ben Jan 04 '17 at 23:10
  • Yeah, I get what you're trying to do, I've also had some similar task. Anyway, these utils are always worth a try. Also, you might try to implement this thing in C. Read exactly the block size of your fs, scan manually for linefeed, carefully (!) count them, and stitch buffers together. And take care that processing a 10G file with no linefeed in it won't crash your machine. – Andrey Stolbovsky Jan 04 '17 at 23:38
  • Just in case you didn't think of that - if you have access to the code that is producing those large files, you can dig around there, and maybe prepare what you need there. – Andrey Stolbovsky Jan 04 '17 at 23:42
  • Also, if by chance your lines are all of the same length, then you're super lucky and you can just seek to your needed lines manually, this definitely should be faster then anything else. But I guess they're of different length... – Andrey Stolbovsky Jan 04 '17 at 23:43
  • Anyway, looking closer at your solution I think your current solution definitely could be improved if you'll read larger blocks and process lines by ourself. – Andrey Stolbovsky Jan 04 '17 at 23:45
  • Units 1 and. 2 should be avoided... I would make unit 1 become unit11 and unit2 become unit12. Some CLOSE and PROGRAM and END can help make this an actual program, but the idea is correct. I would probably use an array of logical S for lines you want to move and an end line to exit out, and then you have one loop. – Holmz Jan 05 '17 at 17:23
  • Thank you both for the input. @AndreyStolbovsky not too experienced with C but I could certainly take a stab at that method if need be. Is there a way to read in larger blocks as you suggested but in fortran? – ben Jan 06 '17 at 02:57
  • The only Fortran I know is Professor Fortran from the book of my childhood, so, sorry, no :) But I think there should be the way, though. – Andrey Stolbovsky Jan 06 '17 at 06:23
-2

One should be able to do this is most any language, so sticking with the theme here is something that should be close to working if you fix up the typos. (If I had a fortran compiler on an iPad that would make it more useful.)

PROGRAM AA
IMPLICIT NONE
INTEGER :: In_Unit, Out_Unit, I
LOGICAL, DIMENSION(1000) :: doIt
CHARACTER(LEN=20) :: FileName = 'in.txt'
CHARACTER(LEN=20) :: Outfile = 'out.txt'
CHARACTER(LEN=80) :: line

open(NEWunit=In_Unit,  fileName)  ! Status or action = read only??
open(NEWunit=Out_Unit, outFile)   ! Status or action = new or readwrite??

DoIt        = .FALSE.
DoIt(3:6)   = .TRUE.
DoIt(18:27) = .TRUE.
DoIt(39:45) = .TRUE.

do i=1,1000
  read(I_Unit,*) line
  IF(doIt(I)) write(Out_Unit,*) line
end do

CLOSE(In_Unit)
CLOSE(Out_Unit)

END PROGRAM AA
Holmz
  • 714
  • 7
  • 14
  • This is definitely not faster than what OP has proposed, and that's specifically what he asked about. First, you're embedding a conditional inside a loop, and you're also reading *and parsing* each line, whereas OP's code only reads and parses the lines he's interested. Downvote from me. – Ross Jan 05 '17 at 21:17
  • @Ross I suppose you have a point. In the case of the example of a few lines at the front of a 4GB files the sequential reading should not make a difference. On my work machine with some RAID on it, a 4GB file gets buffered in in about 15 seconds. But that is using a read into an array of line(), and then trivial conditionals are performed in a loop. One would really need to look at the read performance separately from the conditional. So I think you may be 'getting at' (or hinting at) that a non sequential and non streamed input as "THE FASTEST"? but the post also seems to ask basic IO. – Holmz Jan 07 '17 at 19:22