The slowness of your posted program comes from formatting internal representation to a human readable form (textual representation) and outputting the textual representation.
One optimization not mentioned is to buffer your formatted output, then ouput it. For example, write the formatted text to a buffer, then every 100 or so counts, print out the buffer using a block write. The objective is to reduce the number of output transactions and to make each transaction have a larger amount of data. Basically, one output of 1024 characters will be faster than 1024 outputs of 1 character.
The output depends on the OS and other factors that are beyond your program's control. Your program send the data, for output, to the OS and waits for the OS to complete the request. The completion time depends on task priorities and resource availability (at least). So if your program can count in milliseconds but the I/O takes seconds, your out of luck as no program optimization will help.