2

What's the easiest efficient way to read from stdin and output every nth byte? I'd like a command-line utility that works on OS X, and would prefer to avoid compiled languages.

This Python script is fairly slow (25s for a 3GB file when n=100000000):

#!/usr/bin/env python
import sys
n = int(sys.argv[1])
while True:
    chunk = sys.stdin.read(n)
    if not chunk:
        break
    sys.stdout.write(chunk[0])

Unfortunately we can't use sys.stdin.seek to avoid reading the entire file.

Edit: I'd like to optimize for the case when n is a significant fraction of the file size. For example, I often use this utility to sample 500 bytes at equally-spaced locations from a large file.

Community
  • 1
  • 1
tba
  • 6,229
  • 8
  • 43
  • 63
  • How long does it take to just read the 3GB file on your system? (Making sure it's not in the disk cache when you time this.) – NPE Nov 08 '14 at 23:03
  • Reading the entire file is slow, but I'm interested in the case where n is large. For example, I'd like to sample 500 bytes from a binary file. – tba Nov 08 '14 at 23:12
  • That doesn't necessarily add much. For example, reading every 500th byte of a file on a magnetic disk would quite likely be just as slow as reading the entire file. – NPE Nov 08 '14 at 23:19
  • I'm on an SSD, so hopefully I can do better. Note that I only want to read 500 bytes at equally-spaced locations -- not every 500th byte. – tba Nov 08 '14 at 23:37
  • Does the file have to be fed through stdin? Can you access it directly instead? – NPE Nov 08 '14 at 23:53
  • stdin is preferred, but I'd consider a file-based approach if necessary. – tba Nov 09 '14 at 00:05
  • In which case, use a file and `seek()`. – NPE Nov 09 '14 at 00:06

1 Answers1

2

NOTE: OP change the example n from 100 to 100000000 which effectively render my code slower than his, normally i would just delete my answer since it is no longer better than the original example, but my answer gotten a vote so i will just leave it as it is.


the only way that i can think of to make it faster is to read everything at once and use slice

#!/usr/bin/env python
import sys
n = int(sys.argv[1])
data = sys.stdin.read()
print(data[::n])

although, trying to fit a 3GB file into the ram might be a very bad idea

freeforall tousez
  • 836
  • 10
  • 26