0

A user reported an error to me where the line

read(unit_chk) ((kpt_latt(i,nkp),i=1,3),nkp=1,num_kpts)

failed with the error (similar to Why do I get a C malloc assertion failure?)

malloc.c:2365: sysmalloc: Assertion `(old_top == (((mbinptr)
(((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct
malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >=
(unsigned long)((((__builtin_offsetof (struct malloc_chunk,
fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) -
1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask)
== 0)' failed.
Abort

As far as I know, the error occurs only for a specific set of inputs. Also, when the read() is changed to the equivalent

((kpt_latt(i,nkp),i=1,3),nkp=1,(num_kpts-1)),  &
           kpt_latt(1,num_kpts),kpt_latt(2,num_kpts),kpt_latt(3,num_kpts)

the error disappears. Even compiling with a different compiler version (IntelStudio 2013 SP1 composer_xe_2013_sp1.2.144 instead of IntelStudio 2015 composer_xe_2015.6.233) made the error disappear. (This is all from the user's reports -- I have not yet reproduced the error.)

When the program is run through valgrind, it reports

valgrind: m_mallocfree.c:268 (mk_plain_bszB): Assertion 'bszB != 0' failed.
valgrind: This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.

Before that, there area a couple of messages that Conditional jump or move depends on uninitialised value(s), Use of uninitialised value of size 8 and Invalid read of size 8; and one Invalid write of size 1 on the statement cited above.

The array that is being read into is allocated to the proper size just one line before:

 allocate(kpt_latt(3,num_kpts))
 read(unit_chk) ((kpt_latt(i,nkp),i=1,3),nkp=1,num_kpts)

EDIT: The user has reported back with a possible solution. The array kpt_latt that is being read was declared with a wrong data type, namely as integer while the data in the file was written as real. This is an error of course; but is it realistic that this caused the failed malloc() assertion?

Fine print: We are talking about a default-kind integer (4 bytes) and a double precision real (8 bytes) here. The resulting bogus values in kpt_latt were not noticed because the program does not actually use them. I still have not reproduced the error myself, so I have to rely on what the user tells me.

Community
  • 1
  • 1
xebtl
  • 450
  • 4
  • 11
  • 1
    *"Any pointers on how to debug this?"* The error can be somewhere else. Make sure you use all possible debugging flags from your compiler. You do not report your flags, but use `ifort -g -traceback -warn -check`. Try to isolate a small case that still causes an error. The problem may be in the way how you call the subroutine which contains the above lines, for example. Or some out-of bounds array access which overwrites the malloc's private data. – Vladimir F Героям слава Jan 29 '16 at 17:39
  • @VladimirF Thank you for your suggestions. Unfortunately, I could not reproduce the bug, though the user sent me his input files. The “check” flags do not report anything on my machine, either. I think I will just have to let this go, at least as long as there is only this one report of the bug under very specific circumstances. – xebtl Feb 02 '16 at 08:33
  • I am voting to close this question as off-topic as the problem cannot be repoduced. – Vladimir F Героям слава Feb 02 '16 at 08:41
  • @VladimirF This is a special situation, which I did try to describe in the question without making it too long (I wrote “This is all from the user's reports -- I have not yet reproduced the error”). Normally, I would only post a question after reproducing the bug, but in this case (1) the user seemed to have put quite a lot of effort into characterizing it, which increased my trust that the report was accurate, and (2) it seemed clear from the outset that it was difficult to reproduce. Ok to close from my side, given the circumstances – xebtl Feb 02 '16 at 08:46
  • 1
    Yes, if there was a mismatch in the size of the datatype, it can very well caused this type of error. These kinds of errors can be avoided by using modules. Basically what happens is that you write to some part of memory that does not belong to you but to malloc. Malloc than finds out its data are messed up. Valgrind and address sanitizations should both be able to diagnose this error. – Vladimir F Героям слава Feb 04 '16 at 10:05
  • @VladimirF Thanks, I guess I can rest easy then ;-). If you want to post something to this effect as an answer, I'd be happy to mark it accepted. – xebtl Feb 04 '16 at 10:15
  • 1
    Hard to write anything without the full code. Just note the valgrind messages, notably the invalid read of size 8 and the invalid write. The invalid write is what can cause the heap corruption but invalid reads are equally bad. You can't let your program to have these, even if it runs, you cannot trust the results. – Vladimir F Героям слава Feb 04 '16 at 10:26

0 Answers0