0

This other post is exactly, I think, regarding what i want to do. Python multiprocessing pool.map for multiple arguments

How I am trying to implement is in my pseudo code:

called by another function in my code

def find_similar(db, num, listofsets):

#db is the sqlite3 database
#num is a variable I need for the sql query
#listofsets - is a list of sets.  each set is a set of strings

    threshold = 0.49


    similar_db_rows=[]
 
    for row in db.execute("SELECT thing1, thing2, thing3 FROM table WHERE num !={n};".format(n=num)):
    #thing3 is a long string, each value separated by a comma
        items = set(row[3].strip().split(','))
        for set_item in listofsets:
            sim_score = sim_function(set_item, items) 
            if sim_score<threshold:
                similar_db_rows.append(row)
    return similar_db_rows

def sim_function(x,y):
#x is a set, and y is a second set.  The function does some calculation
and comparing, then returns a float value

    return float_value

This works. What I was trying to do was use multiprocessing on the 2nd for loop. Instead of iterating each set (as my list of sets can have a lot, and this is a major bottle neck) and calling the function, I wanted to use multiprocessing so that it would call the function for these sets, passing a set along with the second constant argument from the sqlquery, many at a time, then return the resulting number from each set into a list. After all of the sets have been processed, then I can use the items in that list to check if any item meets a threshold.

I tried to use the func_star and pool.map(func_star, itertools.izip(a_args, itertools.repeat(second_arg))) by Sebestian AND the `parmap' by zeehio. But for me, for example if I had 30 sets in the list, it was returning a list of results more than 30 times, each return it would check for similarity threshold, and appending rows, but never breaking out of this and I end up control Z'ing the whole thing.

Below is an example of what I attempted, first using parmap:

def find_similar(db, num, listofsets):

#db is the sqlite3 database
#num is a variable I need for the sql query
#listofsets - is a list of sets.  each set is a set of strings

threshold = 0.49

list_process_results=[]
similar_db_rows=[]
 
for row in db.execute("SELECT thing1, thing2, thing3 FROM table WHERE num !={n};".format(n=num)):
        items = set(row[3].strip().split(','))
        list_process_results = parmap.starmap(sim_function, zip(listofsets), items)
        print list_process_results

        if any(t < threshold for t in list_process_results):
            #print "appending a row"
            similar_db_rows.append(row)
return similar_db_rows

and func_star:

def func_star(a_b):
"""Convert `f([1,2])` to `f(1,2)` call."""
return sim_function(*a_b)

def find_similar(db, num, listofsets):
pool = Pool()

#db is the sqlite3 database
#num is a variable I need for the sql query
#listofsets - is a list of sets.  each set is a set of strings

threshold = 0.49

list_process_results=[]
similar_db_rows=[]
 
for row in db.execute("SELECT thing1, thing2, thing3 FROM table WHERE num !={n};".format(n=num)):
        items = set(row[3].strip().split(','))
        list_process_results=pool.map(func_star, itertools.izip(listofsets, itertools.repeat(items ) ))
        print list_process_results

        if any(t < threshold for t in list_process_results):
            #print "appending a row"
            similar_db_rows.append(row)
return similar_db_rows

The same is happening for me with both, it goes on forever, returning a list of the # I am expecting (a different set of values each time), "appending a row", and never breaking out.

Thanks for the help!!! extra is if multiprocessing can also be used for the results of the row query (the outer loop) but i will first conquer the inner loop

To answer Dano question about find_similar()--- I have another function that has a for loop. Each iteration of this for loop calls find_similar. When the resulting list is returned from find_similar, it prints the length of the list return, it then proceeds to finish the remainder of the loop, and go to the next for element. After this for loop is finished, the function is over, and find_similiar is not called again.

Community
  • 1
  • 1
KBA
  • 191
  • 1
  • 5
  • 18
  • What platform is this on? – dano Aug 21 '14 at 18:27
  • I am using OSX `Python 2.7.8 |Anaconda 2.0.0 (x86_64)| (default, Jul 2 2014, 15:36:00) [GCC 4.2.1 (Apple Inc. build 5577)] on darwin Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://binstar.org` – KBA Aug 21 '14 at 18:29
  • `sim_function(set, set(row[3].strip().split(','))) ` this line is confusing. you're doing `for set in listofsets:`. So `set` is the name of the current entry in `listofsets`, but it also looks like you're trying to use `set` as a constructor in the second argument to `sim_function` (`set(row[3]...)`). Should it really be `for set_ in listofsets: sim_function(set_, set(row[3]...)`? – dano Aug 21 '14 at 18:35
  • @dano sorry for the confusion - I have re-edited the above. Yes you are right, that is what i am doing. I am going through the entries in the list (each is a set_, or set_item) and then the set constructor on the 2nd argument to make that list a set. The `sim_function` takes 2 sets as input. – KBA Aug 21 '14 at 18:43
  • is it possible that sim_function calls find_similar? That would account for the looping. – cs_alumnus Aug 21 '14 at 19:12
  • @kzams no, sim_function does not call find_similiar - it does some calculations on the 2 sets that it gets and returns a float – KBA Aug 21 '14 at 19:44

1 Answers1

0

Here's a slightly nicer looking version of what you're trying to do, using functools.partial instead of izip/repeat/func_star.

def sim_function(row_set, set_from_listofsets): # Note that the arguments are reversed from what you had before
    pass

def find_similar(db, num, listofsets):
    pool = Pool()
    threshold = 0.49

    similar_db_rows=[]
    for row in db.execute("SELECT thing1, thing2, thing3 FROM table WHERE num !={n};".format(n=num)):
            func = partial(sim_function, set(row[3].strip().split(',')))
            list_process_results = pool.map(func, listofsets)
            print list_process_results

            if any(t < threshold for t in list_process_results):
                #print "appending a row"
                similar_db_rows.append(row)
    pool.close()
    pool.join()

The behavior you're describing is odd, though. I don't see why either of the versions you had before would end up running in an infinite loop, especially on a non-Windows platform.

dano
  • 91,354
  • 19
  • 222
  • 219
  • thanks for that. I just tried it. What is happening now is, for example, pass a list of 33 sets, I get a list of 33 results. The print of the list is in the infinite loop, but each list being printed has different values. It never gets to the "appending a row" statement. the `sim_function` can get the 2 sets in any order. – KBA Aug 21 '14 at 18:55
  • @KBA Is it possible that `find_similar` is being called repeatedly? – dano Aug 21 '14 at 18:57
  • I have another function that has a for loop. Each iteration of this for loop calls `find_similar`. When the resulting list is returned from `find_similar`, it prints the length of the list return, it then proceeds to finish the remainder of the loop, and go to the next for element. After this for loop is finished, the function is over, and `find_similiar` is not called again. – KBA Aug 21 '14 at 19:11
  • One thing about my previous reply, it actually infinite loops on both printing the `list_process_results' and 'print "appending a row"` – KBA Aug 21 '14 at 19:14
  • @KBA Is `find_similar` being called repeatedly when it infinite loops? – dano Aug 21 '14 at 19:21
  • @KBA Also, what is the value of `row` during the infinite looping? I'm trying to understand how a `for` loop could be looping infinitely. – dano Aug 21 '14 at 19:30
  • No, `find_similar` is not being called repeatedly in the infinite loop. They way I am thinking this is because the function that calls `find_similar` in it's for loop, also does print statements to show the progress of the code in the loop, and to the next loop. none of these things are being printed at all. All that is printing on the screen is the list repeatedly. – KBA Aug 21 '14 at 19:31
  • row is `` now that you mention this, it looks like it is returning a list for each iteration of the `for row in db.execute` before moving on (if it could move on, there are 3000 rows in the database possible results). Ideally, I wanted to not go to the next row until all of the set items were processed. IF there is not a way to deal with this, I could re-write to `fetchall' rows instead of iterate each row result, then for each list of sets, go through each list of rows.. then maybe the multiprocess result is a yes or no index of which rows to keep – KBA Aug 21 '14 at 19:39
  • @KBA Is there a way for you to actually uniquely identify each row when you print it out? Maybe just by printing the value of row[0] or whatever element in the `row` list has a unique value? – dano Aug 21 '14 at 19:44
  • yes, I am printing row[0] (each is unique in the db) and the `list_process_results` What I see now printing is the unique row[0] and list of 33 floats. it goes through all rows in the db (it really should be grabbing a subset or rows). problem is it is appending every row, when actually the condition does not hold true for all rows calculations (I am comparing to my results before multiprocess) – KBA Aug 21 '14 at 20:01
  • @KBA Hmm, well, the fact that your DB query is returning *all* the rows seems to be unrelated to the use of `multiprocessing`, right? As for every row being appended, do the contents of `list_process_results` look correct for each row? Your non-`multiprocesing` code actually looks like it should append the same `row` to `similar_db_rows` multiple times, assuming `sim_score < `threshold` is `True` for multiple `set_items` in a `row`. So I would have actually expected the `multiprocessing` version to end up with a smaller `similar_db-rows` list. – dano Aug 21 '14 at 20:12
  • This is puzzling. I see what you are saying. Post multi-processing, e.g. 1st case, 33 sets, and for that query and after the for loop & function call finishes, the result is 882 rows appended (I print the length), not 3100 all the rows in the db which is what is printing w/multi process code. anyhow, this yes is outside of multiprocessing. comparing the list of 33 values process and preprocess, results are a different set (I used set() on the two list prints) of numbers for the same row and same set :I – KBA Aug 21 '14 at 21:07
  • @KBA well that is really strange. Is it possible there's some other difference between the two versions of the script? Maybe different parameters being passed into `find_similar`? – dano Aug 21 '14 at 21:10
  • oh my goodness, sorry, there was a typo in my two scripts as I was figuring things out, now it is fine, the two lists for a row is the same result. as for the other mystery, I am not sure, but I have changed the list to a set for the saved rows, and this gives correct matches and results as pre-multiprocessing. Thanks for the above and beyond help with this – KBA Aug 22 '14 at 01:29