Python: Combine Lists from different files in parallel or multithreaded

Question

This is my first question here. I started learning python a few days ago and i have a problem.

I made some python files that each of them runs a for loop and appends the results to a list. So each file has its own list.

For example file1.py produces list1, and file2.py produces list2 etc...

My goal, is to combine all these lists together, so i am making a separate "main.py" file and import the list names and then combine them together like this:

from file1 import list1
from file2 import list2
from file3 import list3

combined_lists = [*list1, *list2, *list3]

and that is working fine as expected.

But the problem is that this method is very slow, because it is importing the lists one by one in serial in the order i have them imported.

For example, when i run it, it is importing first the list1 and when the list1 is completed it starts the list2 and then the list3 etc.. and finally combines them together.

So, because i have 400 lists on 400 different files, this is taking a very long time.

Is there any way to import and combine all the lists together in parallel?

Like with multi-threading or any other method?

Note, that i don't care about the order of the items in the combined list.

This doesn't _exactly_ give you what you're asking for, but it might be worth exploring. Take a look at `itertools.chain`: https://docs.python.org/3/library/itertools.html It really depends what you're doing with your list after and if you actually _need_ a list, but this approach will give you an iterator of all the lists chained together without loading the whole thing into memory. — wholevinski, Oct 01 '18 at 17:50
Possible duplicate of [read multiple files using multiprocessing](https://stackoverflow.com/questions/2068645/read-multiple-files-using-multiprocessing) — stovfl, Oct 01 '18 at 19:59
@wholevinski i tried with itertools, but it was still serial.I did not seem to find a way to make them come in parallel. — mezca, Oct 02 '18 at 09:32
I guess my suggestion more depends on: what are you doing with the big list you're building after? The `itertools.chain` suggestion is more to prevent actually building that list in the first place, then iterating over it when you're trying to use it (which I'm assuming you're going to do at some point anyway). — wholevinski, Oct 02 '18 at 10:04

kungphu · Answer 1 · 2018-10-02T02:33:25.837

0

You could spawn multiple reader processes (via a Pool, preferably) that feed a Queue, with a single consumer that reads from it. You can do this with threading as well; some relevant sample code can be found here.

Note that in this case the consumer probably should not collect the results into a single list, but rather it should run the actual operation you want to perform on each element as they come out of the queue.

However...

I made some python files that each of them runs a for loop and appends the results to a list. So each file has its own list.

Why? It sounds like this is way more complicated than it should be, but without knowing what you're actually trying to accomplish, it's impossible to say for sure.

Without more information, if you have this volume of data to deal with, it sounds like your scripts should be generating CSV files (or they should be combined into a single script that generates a single CSV file). Even using an RDBMS might be a better idea than regenerating these data sets every time they're imported, unless they change very often.

edited Oct 02 '18 at 02:33

answered Oct 02 '18 at 00:46

kungphu

4,592
3
28
37

I have them separated in different files to keep them organized because every loop in every file is different.Yes the data sets change very often.Everyday the lists that each of the loops produce are different.In my main file,when all lists are combined,i export the combined list as json.Do you think maybe it is better for each file instead of producing a list, to produce a json and then on my main file to read and combine all the json files together? – mezca Oct 02 '18 at 09:43
They don't need to be in different files; you could write them as functions, which would have the advantage of being able to import them ahead of time without having them immediately execute. While JSON is a better format, re-listing the column names for each object makes it much larger; if you decide to dump to files first, I would go with CSV. Regarding what's a better approach, if you're going to have to re-run these a lot anyway, having the intermediate step doesn't seem like much of a benefit. – kungphu Oct 03 '18 at 00:32

Python: Combine Lists from different files in parallel or multithreaded

1 Answers1