0

I am testing a small Bigtable cluster (minimum 3 nodes). I see on the Google console that as the Write QPS level approaches 10K, the CPU utilization approaches the recommended maximum of ~80%.

From what I understand, the QPS metric is for the whole instance, not for each node? In that case, why is the CPU threshold reached while technically the QPS load of the instance is just 1/3 of 30K guidance max? I'm just trying to understand if something is off with my data upload program (done via Dataflow).

Also curious why I never manage to observe anything close to the 30K Writes/sec, but I suspect this is due to the limitations on the Dataflow side, as I'm still restricted to the 8 CPU quote while on trial...

VS_FF
  • 2,353
  • 3
  • 16
  • 34

1 Answers1

0

The CPU graph shows the definitive metric to show that Bigtable is overloaded. Unfortunately, QPS isn't the ideal metric to determine the root cause of the overload since we added the bulk write API. Bigtable / Dataflow loading uses the cloud bigtable bulk APIs which send multiple requests in a single batch and 1 query now can have a few dozen update requests. Rows Updated Per Second would be a better metric, but alas it does not exist yet on the Cloud Bigtable side. There is an equivalent metric in your Dataflow UI in the Cloud Bigtable step, and you can use that number to judge Cloud Bigtable performance.

The rule of thumb I use is ~3 Dataflow worker CPUs per 1 Cloud Bigtable node when doing writes. It's very likely that your job is properly configured with 8 CPUs and 3 Bigtable nodes. Given your description, I think that your system is working as efficiently as possible.

Solomon Duskis
  • 2,691
  • 16
  • 12
  • Thanks for the above, helpful as always. Two additional observations, for other's sake: (1) I pre-split the table into several regions using partial row-key, following suggestions from this post [link](http://stackoverflow.com/questions/39169327/populating-data-in-google-cloud-bigtable-is-taking-long-time). This seems to cut the time to upload the rows in half. – VS_FF Jan 04 '17 at 14:19
  • (2) Most curiously, seems like Bigtable throughput is much higher than the Console Chart suggests. I wrote 100m rows in ~12mins using 8 Dataflow workers (so 100m/12/60/3 = ~45K writes/sec per node??) even though the chart never showed over 7K writes/sec. Maybe that's why CPU utilization was close to max at that stage. Also, using a flat (non-Dataflow) 1n-standard-8 machine I performed 9m scan/write/delete operations on those rows in ~2mins, so 9m/2/60/3 = ~20k operations/sec. Again, the console chart never showed >7k/sec operations. – VS_FF Jan 04 '17 at 14:29
  • I physically traversed each row after both test -- seem all there) So the actual performance seems much better than those Console charts suggest... – VS_FF Jan 04 '17 at 14:30
  • You are correct sir. A single Scan returns many rows, and are considered 1 operation. Multiple Writes/Deletes via Buffered Mutator are combined into a single operation. Cloud Bigtable does quite well in terms of performance and needs to do a better job presenting that performance. I'm not surprised by the 45K writes/sec for the 3 nodes. A word of warning: 3 nodes turn out to be more efficient than 30 nodes in most cases for bulk writes; while you can still expect much better than >10K writes/sec, you'd need larger sample of bigtable nodes in order to predict linear scalability. – Solomon Duskis Jan 04 '17 at 16:58