1

I have a generator whoms content I need to get into an SQL database. The generator contains more than a million items and while iterating over them I send them in bins of 10.000 into a pandas dataframe, and commit that via the to_sql function into the database.

That works fine as long as I can fill the bin with 10.000 items. However, the StopIteration is never triggered, even though the last item of the generator was called by the next method.

Debugging it, I can see item 1.384.471 is processed in the try-part, and then while is still True, next does not get a new item, but instead of activating the except-part the program simply runs into nirvana.

It would be nice if some of you can point me to my mistake.

import ijson
import pandas as pd
import sqlalchemy as sa

def connect_to_database():
    engine = sa.create_engine(f"postgresql+psycopg2://{USER}:{PWD}@{SERVERNAME}:{PORT}/{DB}")

    return engine

engine = connect_to_database()

def create_content(objects, table_name):
''' simple helper function to get content
for every object list from the JSON file'''

    item_lst = []
    item_count = 0

    while True:
        try:
            item = next(objects)
        
            value_conversion(item)
            item_lst.append(item)
            if item_count == 1380000 and len(item_lst) == 4470:
                print('Watch it!')
        
            if len(item_lst) == 10000:
                item_count += len(item_lst)
                submit_sql(table_name, item_lst)
                item_lst.clear()

        except StopIteration:
            if len(item_lst) > 0:
                item_count += len(item_lst)
                submit_sql(table_name, item_lst)
                item_lst.clear()
                print(item_count)

            break               

        except Exception as e:
            print(e)


def submit_sql(engine, table_name, items):
''' submit dataframe content of items to sql database '''
    
    engine = connect_to_database()

    temp_df = pd.DataFrame(items)
    temp_df.to_sql(table_name, engine, if_exists='append', index=False)

    engine.dispose()


with open("C:/Path/To/Json/file.json", "r", encoding="utf-8") as json_file:
    s_objects = ijson.items(json_file, 'ObjectList.item')
    create_content(s_objects, 'tablename')

The version with a simple for-loop doesn't work either. The if-statement after the for-loop is never reached.

try:
    for item in objects:
        value_conversion(item)
        item_lst.append(item)
        
        if len(item_lst) == 10000:
            item_count += len(item_lst)
            submit_sql(table_name, item_lst)
            item_lst.clear()
            print(item_count)


    if len(item_lst) > 0:
        item_count += len(item_lst)
        submit_sql(table_name, item_lst)
        item_lst.clear()
        print(item_count)

except Exception as e:
    print(e)
TomGeo
  • 1,213
  • 2
  • 12
  • 24
  • 1
    please fix your indentation – juanpa.arrivillaga Aug 11 '22 at 16:35
  • btw, you should just for-loop over `objects`, i.e, `for item in objects: ...` don't use `while True: item = next(objects)` – juanpa.arrivillaga Aug 11 '22 at 16:37
  • The ```while True``` was introduced because the for loop with the try-except statement did in the end perform the same way and did not process the following if-statement, that is now found below the StopIteration. I simply do not get to the point where the remaining items (less then 10.000) will be processed. – TomGeo Aug 11 '22 at 16:43
  • then a `StopIteration` isn't being raised – juanpa.arrivillaga Aug 11 '22 at 16:43
  • I updated the original posting. – TomGeo Aug 11 '22 at 17:02
  • If what you say is true that means the `objects` is not empty. You claim "even though the last item of the generator was called by the next method." but apparently that isn't true -- how do you know? There is some assumption you have here which is not true. In any case, the for-loop version should be what you are using – juanpa.arrivillaga Aug 11 '22 at 17:04
  • Possibly related: https://stackoverflow.com/q/24527006/1639625 – tobias_k Aug 11 '22 at 17:17
  • I know that the final number of items in the generator is 1.384.471 because I had to figure that out in painstakingly debugging work. AFAIK, the for-loop is also using internally a StopIteration to end the loop. – TomGeo Aug 11 '22 at 17:20
  • I just tested a simplified version of the code, without pandas and sql, just printing each batch, and it seems to work just fine. – tobias_k Aug 11 '22 at 17:24
  • Yes, a simplified version worked for me as well. Actually I’m both versions: while and simple for-loop. But as soon as I start with the generator from ijson trying to do the real work, the last bit is never processed. I wonder why the try-statement is never left. – TomGeo Aug 11 '22 at 17:30
  • @tobias_k I had a look into the posting you linked. Not that I understand much about generators, in facts it’s the first time I am working with one, but I got the impression your posting was more about how to build the generator, or am I mistaken? – TomGeo Aug 11 '22 at 17:32
  • That's right, but you can still use that. Use it to get the "chunks" from the `objects` iterator, then just convert those to lists, do the preprocessing on each element, and submit them to the DB. – tobias_k Aug 11 '22 at 18:41
  • About the loop not ending: In that case some other exception than StopIteration has to be raised and should be printed in the final except clause. Is there any? O does it just hang forever on `next`? – tobias_k Aug 11 '22 at 18:43
  • As far as I can tell it’s hanging for ever on next. I will try to follow your advice with the chunks. – TomGeo Aug 11 '22 at 20:30
  • The `chunks` function might not work then, either, almost sounds like a bug in `ijson` itself. – tobias_k Aug 11 '22 at 20:55

0 Answers0