0

I'm working with big file manipulation (over 2Gb) and I have a lot of processing functions to deal with the data. My problem is that it is taking a lot (A LOT) of time to finish the processing. From all function the one that seems to take longer is this one:

 def BinLsb(data):
        Len = len(data)
        databin = [0] * (int(Len))
        num_of_bits = 8
        ###convert to bin the octets and LSB first
        for i in range(Len):
            newdatabin = bin(int(data[i], 16))[2:].zfill(num_of_bits)[::-1]
            databin[i] = newdatabin
        ###group the 14bit and LSB again
        databin = ''.join(databin)
        composite_list = [databin[x:x + 14] for x in range(0, len(databin), 14)]
        LenComp = len(composite_list)
        for i in range(LenComp):
            composite_list[i] = (int(str(composite_list[i])[::-1], 2))
        return composite_list

I'd really appreciate some performance tips / another approach to this algorithm in order to save me some time. Thanks in advance!

Izalion
  • 708
  • 4
  • 11
  • 1
    It would be good to have a complete runable example. This also includes (random) input data of the right shape and dtype. Cython or Numba can help here, but with a complete rewriting of the function. – max9111 Sep 29 '20 at 12:22

2 Answers2

1

You can hunt performance issues by profiling the software, but you'll probably be well-served by using logic which takes advantage of a faster language wrapped by Python. This could look like using a scientific library like numpy, using some FFI (foreign function interface), or creating and calling a custom program.

More specifically, Python is natively very slow in computing terms, as each operation carries a lot of baggage with it (such as the infamous GIL). Passing this work off to another language lets you pay this overhead cost less often, rather than at every possible point in every loop!

Scientific libraries can do this for you by at least

  • behaving like Python logic (which is friendly for you!) while doing many known steps per action (rather than one at a time)
  • may be able to vectorize operations (takes more advantage of processing time by performing many actions in the same processor step)
ti7
  • 16,375
  • 6
  • 40
  • 68
0

basic analysis of your function: time complexity: 3O(n) space complexity: 3O(n). because your loop 3 times; my suggestion is loop once, use generator, which will cost 1/3 of time and space.

I upgraded your code and remove some useless variable using a generator:

def binLsb(data):
    databin = ""
    num_of_bits = 8
    for i in range(len(data)):
        newdatabin = bin(int(data[i], 16))[2:].zfill(num_of_bits)[::-1]
        while len(str(databin)) > 14:
            yield (int(str(databin[:14])[::-1], 2))
            databin = databin[14:]
        databin += str(newdatabin)

enjoy

Oliver

Han.Oliver
  • 525
  • 5
  • 8
  • The problem is since I need to regroup the bits after the first loop I can't really see how I can reduce my loops... – Izalion Sep 29 '20 at 08:53
  • "will cost 1/3 of time and space"--actually while generators save on memory they don't provide higher speed--see [Generator speed in python 3](https://stackoverflow.com/questions/2399308/generator-speed-in-python-3). – DarrylG Sep 29 '20 at 09:15
  • @Izalion I upgrade some code, hope these code can give you more hint – Han.Oliver Sep 29 '20 at 18:08
  • @DarrylG that's typical idea from book. – Han.Oliver Sep 29 '20 at 18:10
  • @Han.Oliver--1) Timing test show generators are not faster (but more memory efficient). Actually, they use to be slower than list comprehensions but speed improvements in Python 3 made them comparable i.e. [From List Comprehensions to Generator Expressions](http://python-history.blogspot.com/2010/06/from-list-comprehensions-to-generator.html). 2) In the upgrade code this line: `databin += str(newdatabin)` is problematic since it has long been shown to be the slowest method for concatenating a list i.e. [Efficient String Concatenation in Python](https://waymoot.org/home/python_string/). – DarrylG Sep 29 '20 at 21:45
  • @DarrylG less memory + less calculation == faster..I won't argue that your idea: list comprehension is faster than generator or string concat is very slow, the key idea is that this function can generate an item, not store all items into a list. string concatenation is slow, but in your scenario: databin is less than 14 byte, and plus newdatabin. maybe you should get understand why string concatenation is slow in python: string is immutable, so the statement: "databin += str(newdatabin)" use "=" will concate string and store them into a new memory space called the same name: databin, – Han.Oliver Sep 30 '20 at 07:52
  • @Han.Oliver--no point in arguing but if you time this on actual data (i.e. I tried on data of length 1 million), this solution is 50% slower than the original code. – DarrylG Sep 30 '20 at 09:40