1

I am working in C++.

I have a large file of data in numpy format. The size of the file is exactly 8,964,000,128 bytes, and it consists of 128 bytes of header data (which I don't care about), followed by a 3000 x 3000 x 249 array of binary floats. (Note that 128 + 3000*3000*249*sizeof(float) = 8,964,000,128, so the file size checks out.)

I want to load the 3000 x 3000 x 249 = 2,241,000,000 float values into a vector. When I try to do that with ifstream::read, it only fills the first 93,516,352 entries of the vector with data, leaving the remaining 2,147,483,648 entries unchanged. I know they are not changed because they remain equal to their initial value of 0, and I happen to know for sure that there are no 0s in the file data. What am I doing wrong? I note that the number of unchanged entries is INT_MAX + 1, which can't be a coincidence. So perhaps there is some kind of overflow occurring somewhere?

Here is a minimal working example that generates this problem:

#include <iostream>
#include <fstream>
#include <vector>
#include <climits>
#include <algorithm>

int main(){
    std::ifstream ifs("fine_FA_S16_I000_recon.npy", std::ifstream::binary);
    ifs.seekg(128); // Skip header

    size_t nx = 3000, ny = 3000, nz = 249;
    std::vector<float> data(nx*ny*nz, 0);
    ifs.read(reinterpret_cast<char*>(&data[0]), sizeof(float)*nx*ny*nz);

    size_t zeroes = std::count(data.begin(), data.end(), 0);
    std::cout << "Size of vector = " << data.size() << '\n';
    std::cout << "Number of zeroes = " << zeroes << '\n';
    std::cout << "INT_MAX = " << INT_MAX << '\n';
    return 0;
}

The result I was expecting is:

Size of vector = 2241000000
Number of zeroes = 0
INT_MAX = 2147483647

The result I get is:

Size of vector = 2241000000
Number of zeroes = 2147483648
INT_MAX = 2147483647

I'm sure I'm doing something silly, but I can't see it.

John Barber
  • 111
  • 4
  • @tadman It's floats, not integers, which are stored in the file. I know how the data is stored because 1) I know it's in numpy format, and 2) I checked. What do you propose instead of writing to the internals of std::vector? – John Barber Jan 10 '23 at 20:41
  • I mean more generically "numbers", as in the `float` form varies from one ISA to another in significant ways. Unless you're completely sure this is written on the same machine, in the form your ISA expects, it could be as much as junk data. Maybe what you need is a numpy data format reader, or use some other neutral form like [Parquet](https://github.com/apache/parquet-cpp). – tadman Jan 10 '23 at 20:42
  • 1
    Recommendation: Test the stream state to see if it knows it died early and confirm the amount read with `gcount`. – user4581301 Jan 10 '23 at 20:42
  • 1
    What's `numpy` format and is it the same as IEEE 754? – Richard Critten Jan 10 '23 at 20:42
  • @tadman But we wouldn't expect the junk data to consist of all 0s in that case, since the original data that got scrambled was not 0s. Or is that wrong? – John Barber Jan 10 '23 at 20:43
  • You shouldn't assume it's fine to dump in as-is. Maybe you're reading the file wrong. In short, **use a numpy reader**, like [perhaps this one](https://github.com/llohse/libnpy). You should also experiment with populating the array with something like `NaN` first to be sure you're not actually reading on zeroes. If you are going to do this rawdog style, maybe read in a small buffer, then use `push_back` to add to the array only elements you've *confirmed you've read* via the return of `read`. – tadman Jan 10 '23 at 20:44
  • A 32 bit float should look like this: [https://www.h-schmidt.net/FloatConverter/IEEE754.html](https://www.h-schmidt.net/FloatConverter/IEEE754.html) – drescherjm Jan 10 '23 at 20:45
  • @user4581301 I tested the stream state with gcount, and the result is 374065408 bytes read, which is not enough. Also, the eofbit gets set. So it thinks it's hitting the end of the file. I don't know why, since the file contains much more data than that. – John Barber Jan 10 '23 at 20:45
  • @RichardCritten numpy is the standard format for storing python arrays in a file. There's a link to the numpy specs in the text of my post. – John Barber Jan 10 '23 at 20:46
  • @tadman I tried populating the array with (arbitrarily) 7s instead of 0s. The same thing happens, with the 7s unchanged. – John Barber Jan 10 '23 at 20:47
  • @tadman Even if I'm reading the file wrong, I don't think that explains why it's hitting eof after reading a number of bytes much less than the actual file size. – John Barber Jan 10 '23 at 20:50
  • 4
    Clang and GCC says the `count` parameter to `ifs.read` is only a `long` (spec says implementation defined) - live - https://godbolt.org/z/q5T9d89cn . MSVC uses `__int64` – Richard Critten Jan 10 '23 at 20:53
  • File was opened `binary`, so it shouldn't trip over an EOF character. There's no magical significance in 374065408, but it does happen to be 3000*3000*249 *4 in 32 bit signed wrap-around. That'll be the point @RichardCritten just made, I believe. – user4581301 Jan 10 '23 at 20:56
  • @user4581301 Does that mean the second argument to read is being cast to an int somehow? – John Barber Jan 10 '23 at 20:58
  • 8964000000 ->0x2164BC900 -> 0x164BC900 -> 374065408. Richard has it nailed. If the parameter is a `long`, you're getting truncated to signed 32 bit. – user4581301 Jan 10 '23 at 21:00
  • @SamVarshavchik Doing that gives me 8. Does this mean there's no way on my system to do what I'm trying to do? – John Barber Jan 10 '23 at 21:03
  • I note that if I change everything from size_t to long it doesn't solve the problem. – John Barber Jan 10 '23 at 21:05
  • @RichardCritten I'm not sure I follow. If the second argument to read is a long, that should still be big enough, shouldn't it? – John Barber Jan 10 '23 at 21:06
  • It won't . If you dig into the link Richard gave, you'll see `call std::basic_istream >::read(char*, long)` Your number will be truncated when you make the call to read. You'll have to slurp up the file in blocks with multiple `read` calls. – user4581301 Jan 10 '23 at 21:07
  • @JohnBarber I am not sure, it was just unexpected. I was expecting `std::size_t` or some other unsigned 64-bit type. It will depend on your system. – Richard Critten Jan 10 '23 at 21:07
  • Depends on how big `long` is on your system. Print out `sizeof (long)`. Typically it's 4. You need 5 or more. I'm just as surprised that it's not `size_t`. – user4581301 Jan 10 '23 at 21:08
  • @user4581301 I get sizeof(long) = sizeof(size_t) = 8 – John Barber Jan 10 '23 at 21:09
  • In that case..., WTF... You're definitely getting truncated, but by who? – user4581301 Jan 10 '23 at 21:10
  • @user4581301 Hence my initial confusion. – John Barber Jan 10 '23 at 21:11
  • I recommend adding compiler, compiler version and your build target to the question. I could be overlooking something, but a deeper dive into the tools and the library that comes with those tools might be necessary . – user4581301 Jan 10 '23 at 21:19
  • @user4581301 I'm compiling with gcc version 11.2.0. I'm afraid I don't know what a "build target" is. – John Barber Jan 10 '23 at 21:22
  • You should check size of `std::streamsize`. If it's 4, then you can read at most `2^31-1` bytes per operation. – ALX23z Jan 10 '23 at 21:32
  • Target: are you compiling for Windows, Linux, Linux on an ARM, some itty-bitty little microcontroller... And yeah, `streamsize` is the right guy to look at on your system `streamsize` might not resolve to `long` or `size_t` at all and we'll get another fun surprise. – user4581301 Jan 10 '23 at 21:40
  • @ALX23z I get sizeof(std::streamsize) = 8 – John Barber Jan 10 '23 at 21:42
  • @user4581301 I'm running on Windows under Cygwin. – John Barber Jan 10 '23 at 21:42
  • Then it seems like some kind of bug in the implementation. Try calling read in a loop, see if a similar issue arises. – ALX23z Jan 10 '23 at 21:50
  • Cygwin can be a bit of an odd duck. If you don't need the POSIX compatibility layer it offers you're usually better off with compiling natively for Windows with the tools from [MSYS2.](https://stackoverflow.com/a/30071634/4581301) – user4581301 Jan 10 '23 at 22:03
  • I just tried the same code on a Linux system, and the problem does NOT occur. So I guess Cygwin is the culprit. – John Barber Jan 10 '23 at 22:11
  • 1
    Cygwin must be calling through to WinAPI [`ReadFile`](https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-readfile) which only takes a [`DWORD`](https://learn.microsoft.com/en-us/windows/win32/winprog/windows-data-types) for `nNumberOfBytesToRead` which is a _"A 32-bit unsigned integer."_ . So unless Cygwin is reading the file in multiple chunks at a layer above the OS this will be a problem. – Richard Critten Jan 10 '23 at 22:14

0 Answers0