0

I currently have a for loop which is finding and storing combinations in a list. The possible combinations are very large and I need to be able to access the combos.

can I use an empty relational db like SQLite to store my list on a disk instead of using list = []?

Essentially what I am asking is whether there is a db equivalent to list = [] that I can use to store the combinations generated via my script?

Edit:

SQLlite is not a must. Any will work if it can accomplish my task.

Here is the exact function that is causing me so much trouble. Maybe there is a better solution in general.

Idea - Could I insert the list into the database on each loop and then empty the list? Basically, create a list on each loop, send that list to PostgreSQL and then empty the list in the python to keep the RAM usage down?

def permute(set1, set2):
    set1_combos = list(combinations(set1, 2))
    set2_combos = list(combinations(set2, 8))
    full_sets = []
    for i in set1_combos:
        for j in set2_combos:
            full_sets.append(i + j)
    return full_sets
JF44
  • 67
  • 7
  • Yes, you can use a DB to store your data - that's what DBMS (relational or not) are for, after all :). But depending on what your data are, and how you need to access them, there may be different options available. Just one warning: don't store your list of combinations as a single column in a single row - that will be painfully slow and hard to work with. Instead, store each combination as a separate row (and probably define a primary key on it) – gimix Jun 09 '22 at 07:56
  • Can I store it directly into a DBMS? Like can I have my DBMS set up to essentially be the empty list which my script "feeds" the combinations for later use? Basically, how can I store the combinations and avoid them going into RAM and causing a memory error. – JF44 Jun 09 '22 at 12:20
  • If you want a non-generic answer, please post some sample data and some more information. Where do you get the values from, how do you create the combinations, are they strings, numbers or what, how do you want to access your combinations, ... – gimix Jun 09 '22 at 13:04
  • edited OP to include detail. – JF44 Jun 09 '22 at 13:12

2 Answers2

1

Ok, a few ideas

My first thought was, why do you explode the combinations objects in lists? But of course, since we have two nested for loops, the iterator in the inner loop is consumed at the first iteration of the outer loop if it is not converted to a list.

However, you don't need to explode both objects: you can explode just the smaller one. For instance, if both our sets are made of 50 elements, the combinations of 2 elements are 1225 with a memsize (if the items are integers) of about 120 bytes each, i.e. 147KB, while the combinations of 8 elements are 5.36e+08 with a memsize of about 336 bytes, i.e. 180GB. So the first thing is, keep the larger combo set as a combinations object and iterate over it in the outer loop. By the way, this will also be really faster.

Now the database part. I assume a relational DBMS, be it SQLite or anything.

You want to create a table with a single column defined. Each row of your table will contain one final combination. Instead of appending each combination to a list, you will insert it in the table.

Now the question is, how do you need to access the data you created? Do you just need to iterate over the final combos sequentially, or do you need to query them, for instance finding all the combos which contain one specific value?

In the latter case, you'll want to define your column as the Primay Key, so your queries will be efficient; otherwise, you will save space on disk using an auto incrementing integer as the PK (SQLite will create it for you if you don't explicitly define a PK, and so will do a few other DMBS as well).

One final note: the insert phase may be painfully slow if you don't take some specific measures: check this very interesting SO post for details. In short, with a few optimizations they were able to pass from 85 to over 96K insert per second.

EDIT: iterating over the saved data

Once we have the data in the DB, iterating over them could be as simple as:

mycursor.execute('SELECT * FROM <table> WHERE <conditions>')
for combo in mycursor.fetchall():
    print(combo) #or do what you need

But if your conditions don't filter away most of the rows you will meet the same memory issue we started with. A first step could be using fetchmany() or even fetchone() instead of fetchall() but still you may have a problem with the size of the query result set.

So you will probably need to read from the DB a chunk of data at a time, exploiting the LIMIT and OFFSET parameters in your SELECT. The final result may be something like:

chunck_size = 1000 #or whatever number fits your case
chunk_count = 0
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> LIMIT {chunk_size} ORDER BY <primarykey>'}
while chunk:
    for combo in mycursor.fetchall():
        print(combo) #or do what you need
    chunk_count += 1
    chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> ORDER BY <primarykey>' OFFSET {chunk_size * chunk_count} LIMIT {chunk_size}}

Note that you will usually need the ORDER BY clause to ensure rows are returned as you expect them, and not in a random manner.

gimix
  • 3,431
  • 2
  • 5
  • 21
  • Thank you for such a thorough answer! Optimizing my loop based on what you said was a complete game-changer. I am able to do 5x as many iterations now! – JF44 Jun 10 '22 at 15:09
  • Also, yes I just need to iterate over the final combos sequentially is that easy? – JF44 Jun 10 '22 at 15:34
0

I don't believe SQLite has a built in array data type. Other DBMSs, such as PostgreSQL, do.

For SQLite, a good recommendation by another user on this site to obtain an array in SQLite can be found here: How to store array in one column in Sqlite3?

Another solution can be found: https://sqlite.org/forum/info/99a33767e8a07e59

In either case, yes it is possible to have a DBMS like SQLite store an array (list) type. However, it may require a little setup depending on the DBMS.

Edit: If you're having memory issues, have you thought about storing your data as a string and accessing the portions of the string you need when you need it?

zNaCly
  • 33
  • 1
  • 5