MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen

Question

I am using tensor flow version :

0.12.1

Cuda tool set version is 8.

lrwxrwxrwx  1 root root   19 May 28 17:27 cuda -> /usr/local/cuda-8.0

As documented here I have downloaded and installed cuDNN. But while execeting following line from my python script I am getting error messages mentioned in header:

  model.fit_generator(train_generator,
   steps_per_epoch= len(train_samples),
   validation_data=validation_generator, 
   validation_steps=len(validation_samples),
   epochs=9)

Detailed error message is as follows:

Using TensorFlow backend. 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally 
Epoch 1/9 Exception in thread Thread-1: Traceback (most recent call last):   File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()   File " lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)   File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator) StopIteration

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), 
 but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] 
Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 3.94GiB Free memory:
3.91GiB 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] 
 Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) 
Traceback (most recent call last):   File "model_new.py", line 82, in <module>
    model.fit_generator(train_generator, steps_per_epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9)   File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)   File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
    initial_epoch=initial_epoch)   File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)   File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
    class_weight=class_weight)   File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
    outputs = self.train_function(ins)   File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
    feed_dict=feed_dict)   File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)   File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
    np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)   File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
    return array(a, dtype, copy=False, order=order) MemoryError

If any suggestion to resolve this error is appreciated.

EDIT: Issue is fatal.

uname -a
Linux ip-172-31-76-109 4.4.0-78-generic #99-Ubuntu SMP
Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

sudo lshw -short
[sudo] password for carnd:
H/W path    Device  Class      Description
==========================================
                    system     HVM domU
/0                  bus        Motherboard
/0/0                memory     96KiB BIOS
/0/401              processor  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
/0/402              processor  CPU
/0/403              processor  CPU
/0/404              processor  CPU
/0/405              processor  CPU
/0/406              processor  CPU
/0/407              processor  CPU
/0/408              processor  CPU
/0/1000             memory     15GiB System Memory
/0/1000/0           memory     15GiB DIMM RAM
/0/100              bridge     440FX - 82441FX PMC [Natoma]
/0/100/1            bridge     82371SB PIIX3 ISA [Natoma/Triton II]
/0/100/1.1          storage    82371SB PIIX3 IDE [Natoma/Triton II]
/0/100/1.3          bridge     82371AB/EB/MB PIIX4 ACPI
/0/100/2            display    GD 5446
/0/100/3            display    GK104GL [GRID K520]
/0/100/1f           generic    Xen Platform Device
/1          eth0    network    Ethernet interface

EDIT 2:

This is an EC2 instance in Amazon cloud. And all the files holding value -1.

:/sys$ find . -name numa_node -exec cat '{}' \;
find: ‘./fs/fuse/connections/39’: Permission denied
-1
-1
-1
-1
-1
-1
-1
find: ‘./kernel/debug’: Permission denied

EDIT3: After updating the numa_nod files NUMA related error is disappeared. But all other previous errors listed above is remaining. And again I got a fatal error.

Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Epoch 1/9
Exception in thread Thread-1:
Traceback (most recent call last):
  File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File " lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Traceback (most recent call last):
  File "model_new.py", line 85, in <module>
    model.fit_generator(train_generator, steps_per_epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9)
  File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)
  File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
    initial_epoch=initial_epoch)
  File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)
  File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
    class_weight=class_weight)
  File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
    outputs = self.train_function(ins)
  File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
    feed_dict=feed_dict)
  File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
    np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
  File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
    return array(a, dtype, copy=False, order=order)
MemoryError

What is your OS (kernel and distribution version), what is the host machine (is it multisocket Xeon - check your /proc/cpuinfo or `lscpu` or `lshw -short`); what is `lspci`? Check also https://github.com/tensorflow/tensorflow/issues/2264. Is this "`NUMA node read from SysFS had negative value (-1)`" message the only error you see, is it critical (fatal) error or just warning? Does your program work? — osgx, May 28 '17 at 23:32
@osgx Thank you for your follow up questions. I have also noticed the github link you referred but it was concluded by saying Kernel has no NUMA support. I am not sure how can check or validate the NUMA support. I have edited the question by adding answers for all your additional questions. Thanks a lot — Steephen, May 28 '17 at 23:46
Sorry, first version of answer was incorrect. Just updated it. Please, do `grep . /sys/bus/pci/devices/*/numa_node` to show us real values of this special file on your platform. Linux Kernel docs says it is error in Xen Platform emulation if Linux kernel can't fill the special file correctly. — osgx, May 29 '17 at 00:47
@osgx thanks a lot for helping me in this issue. I have edited my question by listing the current values from the EC2 instance — Steephen, May 29 '17 at 01:00
Miniscript of `for a in /sys/bus/pci/devices/*; do echo 0 | sudo tee -a $a/numa_node; done` should help you. Report the error (Linux sets incorrect `numa_node`) to amazon (and/or google at https://github.com/tensorflow/tensorflow/issues) and/or Ubuntu to stop the rich stream of "SysFS had negative value" reports from everyone who tries TensorFlow on AWS. — osgx, May 29 '17 at 01:08
Steephen, the "negative numa_node" is not fatal error, it is warning. The error is MemoryError, it is in your python code (don't know about PC RAM or GPU RAM). It is not reproducible without full code or [minimal reduced sample](https://stackoverflow.com/help/mcve). Search for words MemoryError and model.fit_generator in google.... Ask https://github.com/fchollet/keras author... — osgx, May 29 '17 at 02:29

osgx · Accepted Answer · 2019-05-12T19:35:07.010

There is the code which prints the message "successful NUMA node read from SysFS had negative value (-1)", and it is not Fatal Error, it is just warning. Real error is MemoryError in your File "model_new.py", line 85, in <module>. We need more sources to check this error. Try to make your model smaller or run on server with more RAM.

About NUMA node warning:

https://github.com/tensorflow/tensorflow/blob/e4296aefff97e6edd3d7cee9a09b9dd77da4c034/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc#L855

// Attempts to read the NUMA node corresponding to the GPU device's PCI bus out
// of SysFS. Returns -1 if it cannot...
static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) 
{...
  string filename =
      port::Printf("/sys/bus/pci/devices/%s/numa_node", pci_bus_id.c_str());
  FILE *file = fopen(filename.c_str(), "r");
  if (file == nullptr) {
    LOG(ERROR) << "could not open file to read NUMA node: " << filename
               << "\nYour kernel may have been built without NUMA support.";
    return kUnknownNumaNode;
  } ...
  if (port::safe_strto32(content, &value)) {
    if (value < 0) {  // See http://b/18228951 for details on this path.
      LOG(INFO) << "successful NUMA node read from SysFS had negative value ("
                << value << "), but there must be at least one NUMA node"
                            ", so returning NUMA node zero";
      fclose(file);
      return 0;
    }

TensorFlow was able to open /sys/bus/pci/devices/%s/numa_node file where %s is id of GPU PCI card (string pci_bus_id = CUDADriver::GetPCIBusID(device_)). Your PC is not multisocket, there is only single CPU socket with 8-core Xeon E5-2670 installed, so this id should be '0' (single NUMA node is numbered as 0 in Linux), but the error message says that it was -1 value in this file!

So, we know that sysfs is mounted into /sys, there is numa_node special file, CONFIG_NUMA is enabled in your Linux Kernel config (zgrep NUMA /boot/config* /proc/config*). Actually it is enabled: CONFIG_NUMA=y - in the deb of your x86_64 4.4.0-78-generic kernel

The special file numa_node is documented in https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci (is the ACPI of your PC wrong?)

What:       /sys/bus/pci/devices/.../numa_node
Date:       Oct 2014
Contact:    Prarit Bhargava <prarit@redhat.com>
Description:
        This file contains the NUMA node to which the PCI device is
        attached, or -1 if the node is unknown.  The initial value
        comes from an ACPI _PXM method or a similar firmware
        source.  If that is missing or incorrect, this file can be
        written to override the node.  In that case, please report
        a firmware bug to the system vendor.  Writing to this file
        taints the kernel with TAINT_FIRMWARE_WORKAROUND, which
        reduces the supportability of your system.

There is quick (kludge) workaround for this error: find the numa_node of your GPU and with root account do after every boot this command where NNNNN is the PCI id of your card (search in lspci output and in /sys/bus/pci/devices/ directory)

echo 0 | sudo tee -a /sys/bus/pci/devices/NNNNN/numa_node

Or just echo it into every such file, it should be rather safe:

for a in /sys/bus/pci/devices/*; do echo 0 | sudo tee -a $a/numa_node; done

Also your lshw shows that it is not PC, but Xen virtual guest. There is something wrong between Xen platform (ACPI) emulation and Linux PCI bus NUMA-support code.

Is there a proper way to set valid numa node permanently? The workaround above works well for me, but it is very uncomfortable to set it each time manually. — 18augst, Apr 12 '20 at 17:46
A simpler way to identify the path to the `numa_node` file (that can be scripted) without having to set the entire system is to use `echo 0 | tee /sys/module/nvidia/drivers/pci:nvidia/*/numa_node`. Note, you also don't need a `for` loop. That's what `tee` is for. — phemmer, May 05 '22 at 12:41

normanius · Answer 2 · 2022-01-12T16:58:24.090

25

This amends the accepted answer:

Annoyingly, the numa_node setting is reset (to the value -1) for every time the system is rebooted. To fix this more persistently, you can create a crontab (as root).

The following steps worked for me:

# 1) Identify the PCI-ID (with domain) of your GPU
#    For example: PCI_ID="0000.81:00.0"
lspci -D | grep NVIDIA
# 2) Add a crontab for root
sudo crontab -e
#    Add the following line
@reboot (echo 0 | tee -a "/sys/bus/pci/devices/<PCI_ID>/numa_node")

This guarantees that the NUMA affinity is set to 0 for the GPU device on every reboot.

Again, keep in mind that this is only a "shallow" fix as the Nvidia driver is unaware of it:

nvidia-smi topo -m
#       GPU0  CPU Affinity  NUMA Affinity
# GPU0     X  0-127         N/A

edited Jan 12 '22 at 16:58

answered Dec 04 '21 at 11:42

normanius

8,629
7
53
83

1

Super, this worked for me too! – Ferenc Lippai Mar 01 '22 at 07:01
7

@normanius, "Nvidia driver is unaware of it": what are the practical consequences of that, and does it has any importance to find e solution to that? – Corrado Mar 03 '22 at 07:53

Griff · Answer 3 · 2022-03-21T12:33:58.497

WOW!!! Thanks very much for this information @normanius. This is the only solution that worked for me on my system (was getting 'read only file system error' with other solutions). Here is the script I use (not as a cron job but as a '/etc/local.d/numa_node.start' bash script for use on a sane non-Systemd linux operating system with OpenRC).

#!/bin/bash
for pcidev in $(lspci -D|grep 'VGA compatible controller: NVIDIA'|sed -e 's/[[:space:]].*//'); do echo 0 > /sys/bus/pci/devices/${pcidev}/numa_node; done

No need for a numa_node.stop script because... well it resets after reboot.

The manual/references for 'sed' or 'lspci' or 'bash' can be found either by running something like 'man sed' from any good Linux bash prompt or by reading an online manpage resource such as: https://linux.die.net/man/1/sed

MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen

3 Answers3

Linked