1

Inspired by this question I tried to measure the FLOPS required by tensorflow for a matrix-matrix multiplication.

For two matrices A and B with sizes (m x p) and (p x n), respectively, the resulting matrix C=AB with size (m x n) has mn entries. For each entry, p multiplications and (p-1) summations are required. Hence, the total number of operations is mn(2p-1).

With the code from the linked question/answer, tensorflow outputs m*n*2p, see code below.

Why is this approximation returned and not the theoretical value? In the worst case, p=1, this approximation is factor 2 larger than the correct value.

import numpy as np
import tensorflow as tf
g = tf.Graph()
run_meta = tf.RunMetadata()
with g.as_default():
    A=tf.convert_to_tensor(np.random.rand(13,9))
    B=tf.convert_to_tensor(np.random.rand(9,7))
    C = tf.matmul(A,B) # shape=[13,7]

    opts = tf.profiler.ProfileOptionBuilder.float_operation()    
    flops = tf.profiler.profile(g, run_meta=run_meta, cmd='op', options
=opts)
    if flops is not None:
        print('Flops should be ', 13*7*(2*9-1))
        print('Approximation 2*13*7*9=',2*13*7*9) 
        print('TF stats gives',flops.total_float_ops)

#Output: 
#Flops should be  1547
#Approximation 2*13*7*9= 1638
#TF stats gives 1638
P-Gn
  • 23,115
  • 9
  • 87
  • 104
Merlin1896
  • 1,751
  • 24
  • 39

2 Answers2

3

I think this is because in practice, summations are often coded like this (pseudo-code below):

total = 0
for i in 0...p
  total += x[i] * y[i]

that is, the first element x[0] * y[0] is summed to total (which is 0 then), which yields p summations rather than p-1.

You could try to be smart and avoid this extra summation:

total = x[0] * y[0]
for i in 1...p
  total += x[i] * y[i]

... but then what happens if p==0? Ouch we need to add an extra comparison:

if p > 0
  total = x[0] * y[0]
  for i in 1...p
    total += x[i] * y[i]
else
  total = 0

The thing is, this comparison is not a flop and will not appear in your flop count -- yet in practice it is as costly, if not more costly, than a simple add.

Bottom line:

  • The flop calculation is probably correct if the implementation does not "optimize away" the initial sum
  • This "optimization" may actually not speed up you code
  • Take flops measures with a grain of salt, and don't worry too much about vanishing components.
P-Gn
  • 23,115
  • 9
  • 87
  • 104
0

I'm not sure why but I think this is the "coded" theoretical value:

...

@ops.RegisterStatistics("MatMul", "flops")
def _calc_mat_mul_flops(graph, node):
  """Calculates the compute resources needed for MatMul."""
  transpose_a = node.attr["transpose_a"].b
  a_shape = graph_util.tensor_shape_from_node_def_name(graph, node.input[0])
  a_shape.assert_is_fully_defined()
  if transpose_a:
    k = int(a_shape[0])
  else:
    k = int(a_shape[1])
  output_shape = graph_util.tensor_shape_from_node_def_name(graph, node.name)
  output_shape.assert_is_fully_defined()
  output_count = np.prod(output_shape.as_list())
  return ops.OpStats("flops", (k * output_count * 2))

...
Y. Luo
  • 5,622
  • 1
  • 18
  • 25