20

According to http://www.agner.org/optimize/instruction_tables.pdf, the POPCNT instruction (which returns the number of set bits in a 32-bit or 64-bit register) has a throughput of 1 instruction per clock cycle on modern Intel and AMD processors. This is much faster than any software implementation which needs multiple instructions (How to count the number of set bits in a 32-bit integer?).

How is POPCNT implemented so efficiently in hardware?

Community
  • 1
  • 1
Siqi Lin
  • 1,237
  • 1
  • 10
  • 25
  • This operation is also known as the *Hamming weight*. That may help you in your research. For example, see [Digital Hamming Weight and Distance Analyzers for Binary Vectors and Matrices (Sklyarov 2012)](http://www.ijicic.org/ijicic-12-12021.pdf). – Jonathon Reinhart Mar 02 '15 at 04:34
  • If something is implemented in hardware (not microcode), it should be fast. Anyway you can easily achieve high speed pop count in software, if you have enough memory and cache, using a lookup table – phuclv Mar 02 '15 at 08:09

1 Answers1

23

There's a patent for combined popcnt, bit scan forward / reverse:

US8214414 B2 - Combined set bit count and detector logic

Abstract

A merged datapath for PopCount and BitScan is described. A hardware circuit includes a compressor tree utilized for a PopCount function, which is reused by a BitScan function (e.g., bit scan forward (BSF) or bit scan reverse (BSR)). Selector logic enables the compressor tree to operate on an input word for the PopCount or BitScan operation, based on a microprocessor instruction. The input word is encoded if a BitScan operation is selected. The compressor tree receives the input word, operates on the bits as though all bits have same level of significance (e.g., for an N-bit input word, the input word is treated as N one-bit inputs). The result of the compressor tree circuit is a binary value representing a number related to the operation performed (the number of set bits for PopCount, or the bit position of the first set bit encountered by scanning the input word).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
rcgldr
  • 27,407
  • 3
  • 36
  • 61
  • While I'm not a fan of link-only answers, that's a pretty cool link. – Mysticial Mar 02 '15 at 06:44
  • 2
    Well the 9 images of schematics would have been difficult to post as an answer. Since it's a patent, everything is explained. – rcgldr Mar 02 '15 at 06:51
  • 19
    Wow, that explains why `popcnt` has a false dependency on the output register on Intel SnB-family. I figured it was just in the same class of uops, not that it really ran on the same path of the same execution unit as `bsr`/`bsf` (which need the destination as an input so they can leave it unmodified for the src=0 case.) Fun fact: Intel fixed the false dep for `tzcnt`/`lzcnt` in Skylake, but not for `popcnt`. – Peter Cordes Sep 22 '17 at 06:29
  • related: [Why does breaking the "output dependency" of LZCNT matter?](https://stackoverflow.com/a/47234390) – Peter Cordes Apr 01 '21 at 17:08
  • 1
    That is amazing that something so trivial can even be patented... –  Aug 13 '22 at 16:09