I've tried to see other similar problems, but they did not quite give me the answer I was looking for. My main goal is to store a large dataset into a google cloud. I tried to store around 1000 insert statement to google cloud and it went well.
In the other hand, storing over 200.000 insert statement proved to be challenge than I thought. In my code, I have this function:
def insert_to_gcloud(information):
db = psycopg2.connect(database="database",
user="user",
password="password",
host="host",
port=1234)
cur = db.cursor()
cur.execute("".join(information))
db.commit()
db.close()
I use batch insert (suppose to be faster) where the first index contain the insert statement and the rest values. Then I use "".join() to make it into a string. To make a simple example:
INSERT INTO tbl_name (a,b,c) VALUES (1,2,3),(4,5,6),(7,8,9);
["INSERT INTO tbl_name (a,b,c) VALUES",
"(1,2,3)," ,
"(4,5,6)," ,
"(7,8,9);" ]
At first, I tried to execute 200.000 insert statement, but I got an error about EOF after about 1 min. I guess it was to big to send, so I made a chunk function. Basically, it divides the array into proper size as I want it. (reference)
Then used a simple for-loop to store each one block at a time:
for i in range(len(chunk_block)):
insert_to_gcloudd(chunk_block[i])
It seems like it was working, but I let it run over the night, used 237 min, I got this message:
psycopg2.OperationalError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.
My next test is to store the chunk block into files, read and log the files. If the server gets closed unexpectedly again, I can check the log. Thought, this is not a good way to do it in my opinion, but I'm out of ideas here.
So to my question, is there any option I could try? Maybe there are other tools I can use?