Fortran : check data format while reading data with 'read'

Question

I have specific dataformat, say 'n' (arbitrary) row and '4' columns. If 'n' is '10', the example data would go like this.

 1.01e+00 -2.01e-02 -3.01e-01    4.01e+02
 1.02e+00 -2.02e-02 -3.02e-01    4.02e+02
 1.03e+00 -2.03e-02 -3.03e-01    4.03e+02
 1.04e+00 -2.04e-02 -3.04e-01    4.04e+02
 1.05e+00 -2.05e-02 -3.05e-01    4.05e+02
 1.06e+00 -2.06e-02 -3.06e-01    4.06e+02
 1.07e+00 -2.07e-02 -3.07e-01    4.07e+02
 1.08e+00 -2.08e-02 -3.08e-01    4.07e+02
 1.09e+00 -2.09e-02 -3.09e-01    4.09e+02
 1.10e+00 -2.10e-02 -3.10e-01    4.10e+02

Constraints in building this input would be

data should have '4' columns.
data separated by white spaces.

I want to implement a feature to check whether the input file has '4' columns in every row, and built my own based on the 'M.S.B's answer in the post Reading data file in Fortran with known number of lines but unknown number of entries in each line.

program readtest

  use :: iso_fortran_env

  implicit none

  character(len=512)     :: buffer

  integer                :: i, i_line, n, io, pos, pos_tmp, n_space
  integer,parameter      :: max_len = 512
  character(len=max_len) :: filename

  filename = 'data_wrong.dat'

  open(42, file=trim(filename), status='old', action='read')

  print *, '+++++++++++++++++++++++++++++++++++'
  print *, '+ Count lines                     +'
  print *, '+++++++++++++++++++++++++++++++++++'
  n       = 0
  i_line  = 0
  do
    pos     = 1
    pos_tmp = 1

    i_line = i_line+1
    read(42, '(a)', iostat=io) buffer

(*1)! Count blank spaces.
    n_space = 0
    do
      pos = index(buffer(pos+1:), " ") + pos
      if (pos /= 0) then
        if (pos > pos_tmp+1) then
          n_space = n_space+1
          pos_tmp = pos
        else
          pos_tmp = pos
        end if
      endif
      if (pos == max_len) then
        exit
      end if
    end do
    pos_tmp = pos

    if (io /= 0) then
      exit
    end if

    print *, '> line : ', i_line, ' n_space : ', n_space

    n = n+1
  end do

  print *, ' >> number of line = ', n

end program

If I run the above program with a input file with some wrong rows like follows,

1.01e+00 -2.01e-02 -3.01e-01  4.01e+02
1.02e+00 -2.02e-02 -3.02e-01  4.02e+02
1.03e+00 -2.03e-02 -3.03e-01  4.03e+02
1.04e+00 -2.04e-02 -3.04e-01  4.04e+02
1.05e+00 -2.05e-02 -3.05e-01  4.05e+02
1.06e+00 -2.06e-02 -3.06e-01  4.06e+02
1.07e+00 -2.07e-02 -3.07e-01  4.07e+02
1.0      2.0       3.0
1.08e+00 -2.08e-02 -3.08e-01  4.07e+02  1.00
1.09e+00 -2.09e-02 -3.09e-01  4.09e+02
1.10e+00 -2.10e-02 -3.10e-01  4.10e+02

The output is like this,

 +++++++++++++++++++++++++++++++++++
 + Count lines                     +
 +++++++++++++++++++++++++++++++++++
 > line :            1  n_space :            4
 > line :            2  n_space :            4
 > line :            3  n_space :            4
 > line :            4  n_space :            4
 > line :            5  n_space :            4
 > line :            6  n_space :            4
 > line :            7  n_space :            4
 > line :            8  n_space :            3   (*2)
 > line :            9  n_space :            5   (*3)
 > line :           10  n_space :            4
 > line :           11  n_space :            4
  >> number of line =           11

And you can see that the wrong rows are properly detected as I intended (see (*2) and (*3)), and I can write 'if' statements to make some error messages.

But I think my code is 'extremely' ugly since I had to do something like (*1) in the code to count consecutive white spaces as one space. I think there would be much more elegant way to ensure the rows contain only '4' column each, say,

read(*,'4(X, A)') line

(which didn't work)

And also my program would fail if the length of 'buffer' exceeds 'max_len' which is set to '512' in this case. Indeed '512' should be enough for most practical purposes, I also want my checking subroutine to be robust in this way.

So, I want to improve my subroutine in at least these aspects

Want it to be more elegant (not as (*1))
Be more general (especially in regards to 'max_len')

Does anyone has some experience in building this kind of input-checking subroutine ??

Any comments would be highly appreciated.

Thank you for reading the question.

luco00 · Accepted Answer · 2021-07-08T18:49:21.890

Without knowledge of the exact data format, I think it would be rather difficult to achieve what you want (or at least, I wouldn't know how to do it).

In the most general case, I think your space counting idea is the most robust and correct. It can be adapted to avoid the maximum string length problem you describe.

In the following code, I go through the data as an unformatted, stream access file. Basically you read every character and take note of new_lines and spaces. As you did, you use spaces to count to columns (skipping double spaces) and new_line characters to count the rows. However, here we are not reading the entire line as a string and going through it to find spaces; we read char by char, avoiding the fixed string length problem and we also end up with a single loop. Hope it helps.

EDIT: now handles white spaces at beginning at end of line and empty lines

program readtest

  use :: iso_fortran_env

  implicit none

  character              :: old_char, new_char
  integer                :: line, io, cols
  logical                :: beg_line
  integer,parameter      :: max_len = 512
  character(len=max_len) :: filename

  filename = 'data_wrong.txt'

  ! Output format to be used later
  100 format (a, 3x, i0, a, 3x , i0)

  open(42, file=trim(filename), status='old', action='read', &
      form="unformatted", access="stream")

  ! set utils
  old_char = " "
  line = 0
  beg_line = .true.
  cols = 0

  ! Start scannig char by char
  do
     read(42, iostat = io) new_char

     ! Exit if EOF
     if (io < 0) then
         exit
     end if

     ! Deal with empty lines
     if  (beg_line .and. new_char==new_line(new_char)) then
         line = line + 1
         write(*, 100, advance="no") "Line number:", line, &
             "; Columns: Number", cols
         write(*,'(6x, a5)') "EMPTYLINE"

     ! Deal with beginning of line for white spaces
     elseif  (beg_line) then
         beg_line = .false.

     ! this indicates new columns
     elseif (new_char==" " .and. old_char/=" ") then
         cols = cols + 1

     ! End of line: time to print
     elseif (new_char==new_line(new_char)) then
         if (old_char/=" ") then
             cols = cols+1
         endif
         line = line + 1

         ! Printing out results
         write(*, 100, advance="no") "Line number:", line, &
             "; Columns: Number", cols
         if (cols == 4) then
             write(*,'(6x, a5)') "OK"
         else
             write(*,'(6x, a5)') "ERROR"
         end if

         ! Restart with a new line (reset counters)
         cols = 0
         beg_line = .true.
     end if
     old_char = new_char
  end do
end program

This is the output of this program:

Line number:   1; Columns number:   4         OK
Line number:   2; Columns number:   4         OK
Line number:   3; Columns number:   4         OK
Line number:   4; Columns number:   4         OK
Line number:   5; Columns number:   4         OK
Line number:   6; Columns number:   4         OK
Line number:   7; Columns number:   4         OK
Line number:   8; Columns number:   3      ERROR
Line number:   9; Columns number:   5      ERROR
Line number:   10; Columns number:   4         OK
Line number:   11; Columns number:   4         OK

If you knew your data format, you could read your lines in a vector of dimension 4 and use iostat variable to print out an error on each line where iostat is an integer greater than 0.

Instead of space counting, after finding the first space, one can use `buf = adjustl(buf(pos+1:))` to eliminate any additional space in the substring. — steve, Jul 08 '21 at 17:18
Your code is really interesting! However, it does require to use formatted string and a sufficiently large buffer. You also need to count spaces, basically trimming extra spaces. In terms of efficiency I am not sure which approach is better. — luco00, Jul 08 '21 at 17:54
Well, OP wants to determine that there are 4 columns, so one needs to at least count 3 spaces in each line. Yes, one needs to specify a sufficient buffer length. Fortran currently does not have an allocate/reallocate mechanism for IO; although I believe J3 may be considering such a feature. — steve, Jul 08 '21 at 18:06
This was exactly my point, if you go char by char you do not need to specify a buffer length. Also, based on your comments I added a check for white spaces at beginning and end of line. Now should be more robust. — luco00, Jul 08 '21 at 18:26
@luco00, Thank you. While it needs more time than mine, possibly since your code run into 'if' statement with every characters, it shouldn't be that concern if I'm not reading tones of input files. Thank you for helping me :) — Sangjun Lee, Jul 09 '21 at 05:40
Glad it helped! I think we can speed up a bit if you assume something about the data (e.g. no trimming required or known data format) — luco00, Jul 09 '21 at 07:31

score 1 · Answer 2 · answered Jul 08 '21 at 17:37

Instead of counting whitespace you can use manipulation of substrings to get what you want. A simple example follows:

program foo

  implicit none

  character(len=512) str    ! Assume str is sufficiently long buffer
  integer fd, cnt, m, n

  open(newunit=fd, file='test.dat', status='old')

  do
     cnt = 0
     read(fd,'(A)',end=10) str
     str = adjustl(str)      ! Eliminate possible leading whitespace
     do 
        n = index(str, ' ')  ! Find first space
        if (n /= 0) then
           write(*, '(A)', advance='no') str(1:n)
           str = adjustl(str(n+1:))
        end if
        if (len_trim(str) == 0) exit    ! Trailing whitespace
        cnt = cnt + 1
     end do
     if (cnt /= 3) then
        write(*,'(A)') '   Error'
     else
        write(*,*)
     end if
  end do

10 close(fd) 

end program foo

urbanjost · Answer 3 · 2021-08-06T03:07:14.797

this should read any line of reasonable length (up to the line limit your compiler defaults to, which is generally 2GB now-adays). You could change it to stream I/O to have no limit but most Fortran compilers have trouble reading stream I/O from stdin, which this example reads from. So if the line looks anything like a list of numbers it should read them, tell you how many it read, and let you know if it had an error reading any value as a number (character strings, strings bigger than the size of a REAL value, ....). All the parts here are explained on the Fortran Wiki, but to keep it short this is a stripped down version that just puts the pieces together. The oddest behavior it would have is that if you entered something like this with a slash in it

10 20,,30,40e4    50 / this is a list of numbers

it would treat everything after the slash as a comment and not generate a non-zero status return while returning five values. For a more detailed explanation of the code I think the annotated pieces on the Wiki explain how it works. In the search, look for "getvals" and "readline".

So with this program you can read a line and if the return status is zero and the number of values read is four you should be good except for a few dusty corners where the lines would definitely not look like a list of numbers.

module M_getvals
private
public getvals, readline
implicit none
contains
subroutine getvals(line,values,icount,ierr)
character(len=*),intent(in)     :: line
real                            :: values(:)
integer,intent(out)             :: icount, ierr
character(len=:),allocatable    :: buffer
character(len=len(line))        :: words(size(values))
integer                         :: ios, i
   ierr=0
   words=' '                            
   buffer=trim(line)//"/"               
   read(buffer,*,iostat=ios) words      
   icount=0
   do i=1,size(values)                 
      if(words(i).eq.'') cycle
      read(words(i),*,iostat=ios)values(icount+1)
      if(ios.eq.0)then
         icount=icount+1
      else
         ierr=ios
         write(*,*)'*getvals* WARNING:['//trim(words(i))//'] is not a number'
      endif
   enddo
end subroutine getvals
subroutine readline(line,ier)
character(len=:),allocatable,intent(out) :: line
integer,intent(out)                      :: ier
integer,parameter                        :: buflen=1024
character(len=buflen)                    :: buffer
integer                                  :: last, isize
   line=''
   ier=0
   INFINITE: do
      read(*,iostat=ier,fmt='(a)',advance='no',size=isize) buffer
      if(isize.gt.0)line=line//buffer(:isize)
      if(is_iostat_eor(ier))then
         last=len(line)
         if(last.ne.0)then
            if(line(last:last).eq.'\\')then
               line=line(:last-1)
               cycle INFINITE
            endif
         endif
         ier=0
         exit INFINITE
     elseif(ier.ne.0)then
        exit INFINITE
     endif
   enddo INFINITE
   line=trim(line)
end subroutine readline
end module M_getvals
program tryit
use M_getvals, only: getvals, readline
implicit none
character(len=:),allocatable :: line
real,allocatable             :: values(:)
integer                      :: icount, ier, ierr
   INFINITE: do
      call readline(line,ier)
      if(allocated(values))deallocate(values)
      allocate(values(len(line)/2+1))
      if(ier.ne.0)exit INFINITE
      call getvals(line,values,icount,ierr)
      write(*,'(*(g0,1x))')'VALUES=',values(:icount),'NUMBER OF VALUES=',icount,'STATUS=',ierr
   enddo INFINITE
end program tryit

Honesty, it should work reasonably with just about any line you throw at it.

PS: If you are always reading four values, using list-directed I/O and checking the iostat= value on READ and checking if you hit EOR would be very simple (just a few lines) but since you said you wanted to read lines of arbitrary length I am assuming four values on a line was just an example and you wanted something very generic.

Fortran : check data format while reading data with 'read'

3 Answers3

Linked

Related