3

I have two tables: trainSearcStream and SearchInfo. My code is supposed to merge the two tables based on a common column named SearchID. the problem is that the code runs for sometime and then terminates without reporting any error in my eclipse console. the number or rows in each tables is: trainSearchStream|392,356,948 and SearchInfo|112,159,462. the size of my database (which include other tables) is 40GB. Update: When I ran code in terminal I got a message saying: 'killed: 9'

import pandas as pd
import sqlite3


# Create connection.
con = sqlite3.connect(path + 'database.sqlite')

#read into a data frame
df = pd.read_sql_query("SELECT * FROM trainSearchStream, SearchInfo WHERE  trainSearchStream.SearchID = SearchInfo.SearchID;", con)

#save to file
df.to_csv(path + 'traindata.csv',index=False,encoding='utf-8')
MAS
  • 4,503
  • 7
  • 32
  • 55
  • 1
    I suppose it terminates during the `read_sql_query` call and not the `to_csv` one? Do you know what the size of the tables would be that you want to query? And how much memory do you have? 40GB is not something that fits in typical laptops' memory. – joris Jul 07 '15 at 11:39
  • BTW, if your only goal of above is to export the data to csv, there are probably better ways to do this than a roundtrip to python. See eg http://stackoverflow.com/questions/6076984/how-to-export-the-results-of-my-query-to-csv-file-sqlite – joris Jul 07 '15 at 11:41
  • 1
    A few question : what is the size of the result of your query ? (you can execute it elsewhere than in python) Do you know if your python is 32 or 64 bits ? I guess the result of your query loads in RAM and you are unlikely to have 40GB of RAM. Maybe it's over 2GB and you are on a 32-bit version and it causes the problem. – Pholochtairze Jul 07 '15 at 11:41
  • 1
    I am not a python guy so I don't know exactly what your code does but I think pd.read_sql_query will load the result of your query into memory. In this case its not a good idea because of the size of the result. I would stream the query and than write the csv line by line. For small results that way would be slower. – Andre Jul 07 '15 at 11:48
  • @joris yes it terminates before to_csv. I have a laptop with 16Gb memory and 500Gb flash storage. I also have the tables in .tsv format. in that format they have size of 9.47Gb and 11.02 Gb. My goal is to join several tables according to a common column then export them to a cvs file. I tried pandas merge function. but my code would terminate. When I ran the pandas code in terminal I got a message saying: 'killed: 9' – MAS Jul 07 '15 at 11:52
  • @Pholochtairze I tried the select command in sqlite3 directly and it worked. this is what I executed: SELECT * FROM trainSearchStream, SearchInfo WHERE trainSearchStream.SearchID = SearchInfo.SearchID; I am new to sql and i don't know how to save my result in new table. – MAS Jul 07 '15 at 12:22

1 Answers1

1

As mentionned, the best way I think is to directly use sqlite and not python.

So the solution is, I guess, this:

sqlite> .mode csv
sqlite> .output test.csv
sqlite> SELECT * FROM trainSearchStream, SearchInfo WHERE  trainSearchStream.SearchID = SearchInfo.SearchID; /*your query here*/
sqlite> .output stdout

Sorry, that is not an answer but commenting with the code would have destroyed the indent and we are using python ...

Could you run this and give us the output please ? (what was printed)

import pandas as pd
import sqlite3


# Create connection.
con = sqlite3.connect(path + 'database.sqlite')

try:
    df = pd.read_sql_query("SELECT * FROM trainSearchStream, SearchInfo WHERE  trainSearchStream.SearchID = SearchInfo.SearchID;", con)
    print "read_sql_query went well"
except Exception, e:
    print "read_sql_query failed: "+ str(e))

try:
    df.to_csv(path + 'traindata.csv',index=False,encoding='utf-8')
    print "to_csv went well"
except Exception, e:
    print "to_csv failed: "+ str(e))
Community
  • 1
  • 1
Pholochtairze
  • 1,836
  • 1
  • 14
  • 18