2

I have quite a large number of data sets to extend.

I'm wondering what would be an alternative/faster way of doing it.

I have tried both iadd and extend, both of them takes quite a while to create an output.

from timeit import  timeit

raw_data = [];
raw_data2 = [];
added_data = range(100000)

# .__iadd__
def test1():
    for i in range(10):
        raw_data.__iadd__(added_data*i);

#extend

def test2():
    for i in range(10):
        raw_data2.extend(added_data*i);


print(timeit(test1,number=2));
print(timeit(test2,number=2));

I feel the list comprehension or array mapping could be an answer to my question ...

Gооd_Mаn
  • 343
  • 1
  • 16

2 Answers2

1

If you need your data as list, there is not much to gain - list.extend and __iadd__ are very close in performance - depending on the amounts you use one or the other is fastest:

import timeit 
from itertools import repeat , chain 
raw_data = [] 
added_data = range(100000) # verify data : uncomment: range(5)

def iadd():
    raw_data = [] 
    for i in range(10):
        raw_data.__iadd__(added_data)
    # print(raw_data)

def extend():
    raw_data = [] 
    for i in range(10):
        raw_data.extend(added_data)
    # print(raw_data)

def tricked():
    raw_data = list(chain.from_iterable(repeat(added_data,10)))
    # print(raw_data)

for w,c in (("__iadd__",iadd),("  extend",extend),(" tricked",tricked)):
    print(w,end = " : ")
    print("{:08.8f}".format(timeit.timeit(c, number = 200)))

Output:

# number = 20
__iadd__ : 0.69766775
  extend : 0.69303196    # "fastest"
 tricked : 0.74638002


# number = 200
__iadd__ : 6.94286992    # "fastest"
  extend : 6.96098415
 tricked : 7.46355973

If you do not need the things, you might be better off using a generator that chain.from_iterable(repeat(added_data,10)) without creating the list itself to reduce the amount of memory used.

Related:

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
1

I'm unsure if there is a better way to do this, but using numpy and ctypes, you can preallocate enough memory for the entire array, then use ctypes.memmove to copy data into raw_data - which is now a ctypes array of ctypes.c_longs.

from timeit import timeit
import ctypes
import numpy

def test_iadd():
    raw_data = []
    added_data = range(1000000)

    for i in range(10):
        raw_data.__iadd__(added_data)


def test_extend():
    raw_data = []
    added_data = range(1000000)

    for i in range(10):
        raw_data.extend(added_data)
    return


def test_memmove():
    added_data = numpy.arange(1000000)  # numpy equivalent of range

    raw_data = (ctypes.c_long * (len(added_data) * 10))()  # make a ctypes array to contain elements

    # the address to copy to
    raw_data_addr = ctypes.addressof(raw_data)
    # the length of added_data in bytes
    added_data_len = len(added_data) * ctypes.sizeof(ctypes.c_long)
    for i in range(10):
        # copy data for one section
        ctypes.memmove(raw_data_addr, added_data.ctypes.data, added_data_len)
        # update address to copy to
        raw_data_addr += added_data_len


tests = [test_iadd, test_extend, test_memmove]

for test in tests:
    print '{} {}'.format(test.__name__, timeit(test, number=5))

This code produced the following results on my PC:

test_iadd 0.648954868317
test_extend 0.640357971191
test_memmove 0.201567173004

This appears to show that using ctypes.memmove is significantly faster.

Oli
  • 2,507
  • 1
  • 11
  • 23