0

I made a script which inserts two lists into another every each 4 element but it takes a really long time to complete. Here are my two very long lists:

listOfX = ['567','765','456','457','546'....] len(383656)
listOfY = ['564','345','253','234','123'....] len(383656)

And the other list which contain some data and where I want to add the data of the other lists:

cleanData = ['2020-04-28T01:44:59.392043', 'c57', '0', '2020-04-28T01:44:59.392043', 'c57', '1'....] len(1145146)

Here what I want:

cleanData = ['2020-04-28T01:44:59.392043', 'c57', '0', 567, 564, '2020-04-28T01:44:59.392043', 'c57', '1', 765, 345]

Finally, here my code:

  ## ADDING X AND Y TO ORIGINAL LIST
  addingValue = True
  valueItem = ""
  loopValue = 3
  xIndex = 0
  yIndex = 0
  print(len(listOfX))

  while addingValue:

    if xIndex > len(listOfX):
      break

    try:
      cleanData.insert(loopValue, listOfY[yIndex])
      cleanData.insert(loopValue, listOfX[xIndex])

    except IndexError:
      addingValue = False
      break

    xIndex += 1
    yIndex += 1
    loopValue += 5

Do you have any idea?

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
  • How are you trying to merge the lists? Do you have code? – dawg May 08 '20 at 20:09
  • Can you add the code you wrote to the question please. – L.Clarkson May 08 '20 at 20:10
  • yes sure, wait im adding it –  May 08 '20 at 20:10
  • you don't show your code. so difficult to improve performance. – gelonida May 08 '20 at 20:10
  • You are going to have to show some code if you want help optimizing it. I would guess that the main issue is that you are loading huge lists into memory. This is going to be excruciatingly slow. You should instead try to use a iterator to load elements on demand, and consume output as you combine it instead of storing it all in memory, if possible. – shellster May 08 '20 at 20:11
  • I THink the main issue is, that you are inserting elements instead od constructing a new list. Do you rally have to insert or could you just create a new list? – gelonida May 08 '20 at 20:17
  • @gelonida Yes i could create a new one i will try –  May 08 '20 at 20:19
  • 1
    I see a problem here: `listOfX` and `listOfY` have 383656 items each, and `cleanData` has only 1145146, which is less than 3*383656; so, if you want to add one item from each of the first two lists after every group of 3 items in `cleanData`, you'll have unused elements left in `listOfX`, `listOfY`. Is that what you intended? – Błotosmętek May 08 '20 at 20:23
  • 3
    As mentioned above, insertion into an existing list is very expensive. It would be better to for i in range(len()): then append elements to a new list, or as I mentioned in my previous comment, even better consume and use it group of elements as you combine them instead of putting them back in a list. You could do that via a generator like so is demonstrated here: https://realpython.com/introduction-to-python-generators/ – shellster May 08 '20 at 20:28

4 Answers4

2

The main problem with your solution was, that in your solution you inserted elements 2 * 383656 times into an existing list. Every time all the elements after the insertion point had to be shifted.

Thus it's faster to create a new list.

If for any reason you want that cleanData stays the same old object with the new data (perhaps, because another function / object has a reference to it and should see the changed data) then write

cleanData[:] = blablabla 

instead of

cleanData = blablabla

I wrote following two solutions (second faster one only after answer got accepted)

import functools
import operator
cleanData = functools.reduce(
    operator.iconcat,
    (list(v) for v in zip(*([iter(cleanData)] * 3), listOfX, listOfY)),
    [])

and

import itertools
cleanData = list(itertools.chain.from_iterable(
    (v for v in zip(*([iter(cleanData)] * 3), listOfX, listOfY)),
    ))

In order to understand the zip(*([iter(cleanData)] * 3), listOfX, listOfY) construct you might look at what is meaning of [iter(list)]*2 in python?

Potential downside of my first solution (depending on the context). Using functools.reduce and operator.iconcat creates a list and no generator.

The second solution returns a list. If you want a generator, then just remove list( and one trailing ) and it will be a generator

Second solution is (about 2x) faster than the first one.

Then I wrote some code to compare performance and results of the two given solutions and mine:

Not a very big difference (2.5x), but the second solution seems to be a bit faster than @Błotosmętek's first solution and Alain T.'s solution.

from contextlib import contextmanager
import functools
import itertools
import operator
import time

@contextmanager
def measuretime(comment):
    print("=" * 76)
    t0 = time.time()
    yield comment
    print("%s: %5.3fs" % (comment, time.time() - t0))
    print("-" * 76 + "\n")


N = 383656
t0 = time.time()
with measuretime("create listOfX"):
    listOfX = list(range(N))

with measuretime("create listOfY"):
    listOfY = list(range(1000000, 1000000 + N))

print("listOfX", len(listOfX), listOfX[:10])
print("listOfY", len(listOfY), listOfY[:10])

with measuretime("create cleanData"):
    origCleanData = functools.reduce(
        operator.iconcat,
        (["2020-010-1T01:00:00.%06d" % i, "c%d" % i, "%d" %i] for i in range(N)),
        [])

print("cleanData", len(origCleanData), origCleanData[:12])

cleanData = list(origCleanData)
with measuretime("funct.reduce operator icat + zip"):
    newcd1 = functools.reduce(
        operator.iconcat,
        (list(v) for v in zip(*([iter(cleanData)] * 3), listOfX, listOfY)),
        [])

print("NEW", len(newcd1), newcd1[:3*10])

cleanData = list(origCleanData)
with measuretime("itertools.chain + zip"):
    cleanData = list(itertools.chain.from_iterable(
        (v for v in zip(*([iter(cleanData)] * 3), listOfX, listOfY)),
        ))

print("NEW", len(cleanData), cleanData[:3*10])
assert newcd1 == cleanData

cleanData = list(origCleanData)
with measuretime("blotosmetek"):
    tmp = []
    n = min(len(listOfX), len(listOfY), len(cleanData)//3)
    for i in range(n):
       tmp.extend(cleanData[3*i : 3*i+3])
       tmp.append(listOfX[i])
       tmp.append(listOfY[i])
    cleanData = tmp

print("NEW", len(cleanData), cleanData[:3*10])
assert newcd1 == cleanData


cleanData = list(origCleanData)
with measuretime("alainT"):
    cleanData = [ v for i,x,y in zip(range(0,len(cleanData),3),listOfX,listOfY)
                for v in (*cleanData[i:i+3],x,y) ]

print("NEW", len(cleanData), cleanData[:3*10])
assert newcd1 == cleanData


Output on my old PC looks like:

============================================================================
create listOfX: 0.013s
----------------------------------------------------------------------------

============================================================================
create listOfY: 0.013s
----------------------------------------------------------------------------

listOfX 383656 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
listOfY 383656 [1000000, 1000001, 1000002, 1000003, 1000004, 1000005, 1000006, 1000007, 1000008, 1000009]
============================================================================
create cleanData: 0.454s
----------------------------------------------------------------------------

cleanData 1150968 ['2020-010-1T01:00:00.000000', 'c0', '0', '2020-010-1T01:00:00.000001', 'c1', '1', '2020-010-1T01:00:00.000002', 'c2', '2', '2020-010-1T01:00:00.000003', 'c3', '3']
============================================================================
funct.reduce operator icat + zip: 0.240s
----------------------------------------------------------------------------

NEW 1918280 ['2020-010-1T01:00:00.000000', 'c0', '0', 0, 1000000, '2020-010-1T01:00:00.000001', 'c1', '1', 1, 1000001, '2020-010-1T01:00:00.000002', 'c2', '2', 2, 1000002, '2020-010-1T01:00:00.000003', 'c3', '3', 3, 1000003, '2020-010-1T01:00:00.000004', 'c4', '4', 4, 1000004, '2020-010-1T01:00:00.000005', 'c5', '5', 5, 1000005]
============================================================================
itertools.chain + zip: 0.109s
----------------------------------------------------------------------------

NEW 1918280 ['2020-010-1T01:00:00.000000', 'c0', '0', 0, 1000000, '2020-010-1T01:00:00.000001', 'c1', '1', 1, 1000001, '2020-010-1T01:00:00.000002', 'c2', '2', 2, 1000002, '2020-010-1T01:00:00.000003', 'c3', '3', 3, 1000003, '2020-010-1T01:00:00.000004', 'c4', '4', 4, 1000004, '2020-010-1T01:00:00.000005', 'c5', '5', 5, 1000005]
============================================================================
blotosmetek: 0.370s
----------------------------------------------------------------------------

NEW 1918280 ['2020-010-1T01:00:00.000000', 'c0', '0', 0, 1000000, '2020-010-1T01:00:00.000001', 'c1', '1', 1, 1000001, '2020-010-1T01:00:00.000002', 'c2', '2', 2, 1000002, '2020-010-1T01:00:00.000003', 'c3', '3', 3, 1000003, '2020-010-1T01:00:00.000004', 'c4', '4', 4, 1000004, '2020-010-1T01:00:00.000005', 'c5', '5', 5, 1000005]
============================================================================
alainT: 0.258s
----------------------------------------------------------------------------

NEW 1918280 ['2020-010-1T01:00:00.000000', 'c0', '0', 0, 1000000, '2020-010-1T01:00:00.000001', 'c1', '1', 1, 1000001, '2020-010-1T01:00:00.000002', 'c2', '2', 2, 1000002, '2020-010-1T01:00:00.000003', 'c3', '3', 3, 1000003, '2020-010-1T01:00:00.000004', 'c4', '4', 4, 1000004, '2020-010-1T01:00:00.000005', 'c5', '5', 5, 1000005]

gelonida
  • 5,327
  • 2
  • 23
  • 41
1

This is implementation of shelister's suggestion:

tmp = []
n = min(len(listOfX), len(listOfY), len(cleanData)//3)
for i in range(n):
   tmp.extend(cleanData[3*i : 3*i+3])
   tmp.append(listOfX[i])
   tmp.append(listOfY[i])
cleanData = tmp
Błotosmętek
  • 12,717
  • 19
  • 29
1

This should be much faster:

cleanData = [ v for i,x,y in zip(range(0,len(cleanData),3),listOfX,listOfY) 
                for v in (*cleanData[i:i+3],x,y) ]

If you use parentheses instead of brackets, the expression becomes a generator that you can use to iterate through the merged data (e.g. with a for loop) without actually creating a copy in a new list

Alain T.
  • 40,517
  • 4
  • 31
  • 51
0

Building on Blotometek's with a generator, you would do something like this:

def get_next_group():
    n = min(len(listOfX), len(listOfY), len(cleanData)//3)
    for i in range(n):
        tmp = cleanData[3*i : 3*i+3]
        tmp.append(listOfX[i])
        tmp.append(listOfY[i])

        yield tmp

#in you main code:

for x in get_next_group():
    #do something with x
    pass

The advantage of the above code is that combination is only done piece by piece as you request it. If you do something with it, and don't store it in a list in memory, memory overhead is reduced. Since you are no longer memory-bound, the CPU can immediately be processing other instructions on each chunk instead of waiting for everything to be combined first.

shellster
  • 1,091
  • 1
  • 10
  • 21