58

So I am trying to compile TensorFlow from the source (using a clone from their git repo from 2019-01-31). I installed Bazel from their shell script (https://github.com/bazelbuild/bazel/releases/download/0.21.0/bazel-0.21.0-installer-linux-x86_64.sh).

I executed ./configure in the tensorflow code and provided the default settings except for adding my machine specific -m options (-mavx2 -mfma) and pointing python to the correct python3 location (/usr/bin/py3). I then ran the following command as per the tensorflow instructions:

bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package //tensorflow:libtensorflow_framework.so //tensorflow:libtensorflow.so

Now that continues to run and run, I haven't seen it complete yet (though I am limited to letting it run for a maximum of about 10 hours). It produces a ton of INFO: warnings regarding signed and unsigned integers and control reaching the end of non-void functions. None of these appear fatal. Compilation continues to tick with the two numbers continuing to grow ('[N,NNN / X,XXX] 4 actions running') and files ticking through 'Compiling'.

The machine is an EC2 instance with ~16GiB of RAM, CPU is 'Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz' with I believe 4-cores, plenty of HDD space (although compilation seems to eat QUITE a bit, > 1GiB)

Any ideas on what's going on here?

Innat
  • 16,113
  • 6
  • 53
  • 101
Zonyl
  • 726
  • 5
  • 13
  • 11
    it finally did finish. `INFO: Elapsed time: 13093.267s, Critical Path: 223.69s` `INFO: 11991 processes: 11991 local.` `INFO: Build completed successfully, 12816 total actions` – Zonyl Feb 05 '19 at 21:21
  • 1
    INFO: Elapsed time: 24005.461s, Critical Path: 245.38s INFO: 18569 processes: 18569 local. INFO: Build completed successfully, 19582 total actions Intel 2 core i7-7500U CPU @ 2.70GHz - 16GiB Debian 10 – Corrado Jul 10 '19 at 12:33
  • INFO: Elapsed time: 7866.310s, Critical Path: 328.71s INFO: 17138 processes: 17138 local. INFO: Build completed successfully, 23724 total actions Intel 6 core i9-8950HK (tf 1.14, basel 0.24.1, cuda 10.1/cudnn 7.6 enabled, other \configure options at default) – ambientlight Jul 17 '19 at 12:28
  • INFO: Elapsed time: 11134.316s, Critical Path: 229.09s INFO: 17804 processes: 17804 local. INFO: Build completed successfully, 18538 total actions. Intel Core i7 @2.8Ghz, 16GB RAM, Ubuntu 18.04 64bit – greco.roamin Oct 17 '19 at 12:50
  • TF 1.15 on MacBook Pro (8 cores): Elapsed time: 14524s (4h,2m), Critical Path: 534s, 18267 processes. 19,308 files to compile. – Jared Nielsen Oct 19 '19 at 00:57
  • TF 2.0 INFO: Elapsed time: 6276.519s, Critical Path: 227.95s, 18346 processes, INFO: Build completed successfully, 26984 total actions, Intel 9750H 16GB ram, TensorRT 6 enabled Cuda10.1, cudnn 7.6. – rahduro Nov 04 '19 at 07:24
  • TF 2.0 on Windows: INFO: Elapsed time: 27456.549s, Critical Path: 24614.21s INFO: 9729 processes: 9729 local. INFO: Build completed successfully, 14818 total actions All optimizations enabled, cuda enabled (CUDA 10.1), AVX2 Intel Core i9 9900K 64Gb RAM 1 TB SSD Samsung 970 Pro – Alexander Egorov Nov 08 '19 at 05:54
  • TF 1.15 on Linux: INFO: Elapsed time: 19688.908s, Critical Path: 294.49s INFO: 18745 processes: 18745 local. INFO: Build completed successfully, 25664 total actions. Phenom II 965 CPU (compiling so avx instructions this old CPU doesn't support are not used) – carthurs Mar 24 '20 at 21:35
  • TF 2.1.0 on Linux (4G RAM): `INFO: Elapsed time: 259105.554s, Critical Path: 724.75s INFO: 15927 processes: 15927 local. INFO: Build completed successfully, 16902 total actions` (compiling in order to get rid of AVX instructions) – Andy May 25 '20 at 16:39
  • I'm compiling to add AVX2 to see if that improves performance. I'm at 88849s so far on a 6 core MBP with 16GB. It seems to be fully consuming 3 cores and limited by memory and paging to disk. – Dan May 28 '20 at 17:37
  • It seems that you experienced what I wanted to try. So this means adding this --config=opt and AVX2 flag will actually increase the pace of building too? – Shilan Jun 11 '20 at 11:34
  • @Dan Did it help? A question, how did you add avx2 flags when running bazel command? I have only added them to configure.py when it asked for it. But should I also add them to bazel build command? – Shilan Jun 11 '20 at 11:40
  • 4
    @Shilan, it ran for a week straight and finally ran out of memory and crashed. I built a new desktop, and it finished in less than an hour. I'm doing everything in Docker and only adding the option at the command line. My laptop has 16GB and my desktop has 32GB. My laptop is a seventh gen 6-core Intel and my desktop is an AMD 3900X with 12-cores. I'm guessing the laptop was paging to disk too much. There's a flag for low resource systems, but I didn't notice it until today. – Dan Jun 13 '20 at 01:51
  • (built 2.2.0 on a 2014 Macbook Pro) INFO: Elapsed time: 25957.062s, Critical Path: 1430.37s INFO: 16460 processes: 16460 local. INFO: Build completed successfully, 17301 total actions – Matt Groth Jun 23 '20 at 22:13
  • TF2 EC2 p3.16xlarge (ludicrous vCPUs and RAM) INFO: Elapsed time: 2684.078s, Critical Path: 471.02s INFO: 25964 processes: 25964 local. INFO: Build completed successfully, 36493 total actions – Rich Andrews Aug 08 '20 at 01:00
  • 2
    You could try to use google collab. I use it for all my tensorflow deep learning programs, because it is free and has tensorflow and everything pre-installed. – Jacob Ward Dec 03 '20 at 01:18
  • 1
    6 and half hours on colab and still running – Aditya Kane Apr 20 '21 at 09:53
  • I suspect the slowness is caused by the bazel system, because even one small C file's (only 500 lines codes) compiling time is above 20 seconds, it's impossible for any other building system. – Clock ZHONG Feb 23 '22 at 13:40

2 Answers2

2

Unfortunately, some programs can take a long time to compile. A couple of hours of compilation is not strange for tensorflow on your setup.

There are reports of it taking 50 minutes on a considerably faster machine

A solution to this problem is to use pre-compiled binaries that are available with pip, instructions can be found here: https://www.tensorflow.org/install/pip.html

Basically you can do this:

pip install tensorflow

If you require a specific older version, like 1.15, you can do this:

pip install tensorflow==1.15

For gpu support you add -gpu to the package name, like this:

pip install tensorflow-gpu

And:

pip install tensorflow-gpu==1.15
0

Here's what I observed, when I ran the below command.

Given few config options:

--config=dbg —Build with debug info.
--config=mkl —Support for the Intel® MKL-DNN.
--config=monolithic —Configuration for a mostly static, monolithic build.

bazel build --config= monolithic //tensorflow/tools/pip_package:build_pip_package

It compiles several cpp code thru clang which consumes most of the build time also, partly it might depend on the CDN target from where it is downloading/compiling the source code (Not sure if there's an option to select a specific target CDN)

Secondly, I have observed the compiled code produces output routines which in-turn requires compiling.

for e.g., initially the completed / target was something like [1000 / 5000] the number on the target creeps up as the completed compilation number increases say it'd be later [2500 / 7500] so at this time it is playing catchup and ultimately it depends on the system configuration (RAM, CPU) for to cut down the overall time.

third option is to experiment with different config options at the top.

Building TensorFlow from source can use a lot of RAM. If your system is memory-constrained, limit Bazel's RAM usage with: --local_ram_resources=2048.

Addtionally adding option -march=native helps improve Ref: https://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/i386-and-x86_002d64-Options.html

NOTE: this are just my observations only for running on a MAC PRO (32 GB Memory, Intel i9 8-core, OS ventura 13.2)