2

I have a binary file that has some fields encoded as BCD (Binary Coded Decimal). Example as below.

14 75 26 58 87 7F (Raw bytes in hex format).

I am using (np.void, 6) to read and convert from binary file and below is the output I am getting.

b'\x14\x75\x26\x58\x87\x7F'

But I would like to get the output as '14752658877', without the fill character 'F' using numpy.

Below is the code: with open (filename, "rb") as f:

    while True:

        chunk = f.read(chunksize)

        if (chunk):

            dt = np.dtype([('a','b'), ('b', '>i4'), ('c', 'S15'),('d', np.str, 7),
                                   ('e', 'S7'), ('f', np.void, 6)])

            x = np.frombuffer (chunk, dtype=dt)
            print (x)

        else:
            break

Also, the input file contains many fixed length binary records. What is the efficient way to convert and store it as ascii file using numpy.

Raj KB
  • 33
  • 5
  • 1
    Show the example code. – Ricardo Branco Nov 21 '18 at 10:42
  • below is the code:with open (filename, "rb") as f: while True: chunk = f.read(chunksize) if (chunk): dt = np.dtype([('a','b'), ('b', '>i4'), ('c', 'S15'),('d', np.str, 7), ('e', 'S7'), ('f', np.void, 6)]) x = np.frombuffer (chunk, dtype=dt) print (x) else: break – Raj KB Nov 21 '18 at 16:24
  • Please edit your original question adding the above code keeping all the formatting and indentation. – Ricardo Branco Nov 21 '18 at 18:10
  • Hi Ricardo, I have edited the original question to add the code. – Raj KB Nov 21 '18 at 21:06
  • 1
    F is not a fill character. It is part of the hex value. – Matt Messersmith Nov 21 '18 at 21:10
  • @MattMessersmith: then you are stating that this is not a BCD value, but a simple hexadecimal value. According to the OP it is BCD, though. Using a non-BCD nibble as terminator makes perfect sense to me. – Jongware Nov 21 '18 at 21:52

1 Answers1

1

I don't know if numpy can somehow accelerate this, but a specalized function can be quickly constructed:

fastDict = {16*(i//10)+(i%10):i for i in range(100)}

def bcdToInteger(bcd):
    result = 0
    while bcd and bcd[0] in fastDict:
        result *= 100
        result += fastDict[bcd[0]]
        bcd = bcd[1:]
    if bcd and bcd[0] & 0xf0 <= 0x90:
        result *= 10
        result += bcd[0]>>4
        if bcd[0] & 0xf <= 9:
            result *= 10
            result += bcd[0] & 0x0f
    return result

>>> print (bcdToInteger(b'\x14\x75\x26\x58\x87\x7F'))  # your sequence
14752658877
>>> print (bcdToInteger(b'\x12\x34\xA0'))   # first invalid nibble ends
1234
>>> print (bcdToInteger(b'\x00\x00\x99'))   # and so does an end of string
99
>>> print (bcdToInteger(b'\x1F'))           # a single nibble value
1

As long as you keep feeding it valid BCD bytes, it multiplies the result by 100 and adds the two new digits. Only the final byte needs some further inspection: if the highest nibble is valid, the result thus far gets multiplied by 10 and that nibble gets added. If the lowest nibble is valid as well, this is repeated.

The fastDict is to speed things up. It's a dictionary that returns the correct value for all 100 hex bytes from 00 to 99 so the number of actual calculations is as small as possible. You can do without the dictionary, but that means you have to do the comparisons and calculations in the if block for every single byte.

Jongware
  • 22,200
  • 8
  • 54
  • 100
  • Thanks for the solution. I am using binascii.hexlify(bcdvalue).decode('utf-8').rstrip('f') to get the preferred result. But I am looking for highly efficient solution as I have many such columns. Our daily record volumes are nearly 1 Billion records. – Raj KB Nov 22 '18 at 09:29
  • @RajKB: Well, my solution seems pretty efficient to me. As you can see in [some other](https://stackoverflow.com/q/53201592) [answers](https://stackoverflow.com/q/11668969), these use an expensive bit-shift-and-compare *twice per byte*; and my code avoids that. Yet even faster code could be written using a custom extension in C, but I'm not going to attempt that. – Jongware Nov 22 '18 at 10:40
  • Hi, I am trying your solution. But I got the below error while running the code. Can you please check. File "", line 7, in bcdToInteger if bcd and bcd[0] & 0xf0 <= 0x90: TypeError: unsupported operand type(s) for &: 'str' and 'int' – Raj KB Nov 22 '18 at 16:38
  • Please ignore the above comment. – Raj KB Nov 22 '18 at 16:46
  • Yes it did. Thanks alot for your answer. – Raj KB Nov 23 '18 at 10:16
  • Completed. Thank you. – Raj KB Nov 25 '18 at 20:51