0

I am not a heavy coding developer, but I have this issue I have a data set of the ID - Time series - Velocity on SQL

I want to divide the data set based on the unique values of ID into 1700 CSV files (which is the unique numbers of ID)

When I ran the following code, the windows stopped working. I need some help in doing the same thing either through Dynamic SQL or through something else.

P.S. The database is 300 Million rows

import pyodbc    
import numpy as np    
import pandas as pd     
db = pyodbc.connect("Driver={SQL Server Native Client 11.0};" "Server=ServerName;""Database;""Trusted_Connection=yes;")     
df = pd.read_sql_query("SELECT* from DataBase", db)     
ID = df['ID'].unique()     
dfs={}     
SR=()    
for i, S in enumerate(ID):     
    dfs[i]=df[df['ID']==S]    
    filename=S+'.CSV'    
    df.to_csv(filename)  
  • 3
    You're pulling 300M rows from the DB (I hope you are on good terms with the DBA), waiting who knows how long for all that data to cross the network (hope you're on good terms with the network admins), and then dumping the entire dataframe (all 300M rows!) to disk ... 1700 times. No wonder it hangs. Try to find a less wasteful SQL query (select all unique IDs) and then loop through each ID and get the data you need. – Danielle M. Mar 29 '19 at 20:00
  • That's likely causing memory issues if the database is as large as you say. If you have a list of IDs you can make individual queries limited to one id at a time using a `WHERE` clause. Otherwise, look into cursors: https://stackoverflow.com/questions/20662356/sql-server-loop-how-do-i-loop-through-a-set-of-records – Kal Zekdor Mar 29 '19 at 20:00
  • Try exporting to text from SQL Server and avoid pandas as medium. – Parfait Mar 29 '19 at 22:45

2 Answers2

0

I'm too new to comment so sorry If the code below doesn't work. Please comment on this answer if you have any issues or if you don't understand how to modify this code to make it fit your answer. Like the comments on your question I suggest you:

-Grab all the unique column names

-Query smaller chunks of the database by looping through the unique ID

import pandas as pd     
db = pyodbc.connect("Driver={SQL Server Native Client 11.0};" "Server=ServerName;""Database;""Trusted_Connection=yes;")     
df_ID = pd.read_sql_query("SELECT Distinct ['ID'] from DataBase", chunksize=1000, db)     
ID = df_ID['ID'].tolist()     

for i in ID:     
    df = pd.read_sql_query("SELECT * from DataBase Where ['ID'] = " + i, db)
    filename=str(i) +'.CSV'    
    df.to_csv(filename)  

I use sqlalchemy to run queries against SQL DB, but I think I got it right above.

Jeffyx
  • 78
  • 2
  • 9
  • I think it's ID = df_ID['ID'].tolist() But I got the idea..... I have put it and I think I will confirm in like a day! Thank you so much! – Lush_coding Mar 29 '19 at 21:15
  • How large do you think each DF is? If they're huge then maybe you need to use chunksize. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html) – Jeffyx Apr 01 '19 at 12:13
  • Each chuncked DF would be around 170,000 rows. I will take a look at the chuncksize link you posted. Thanks – Lush_coding Apr 02 '19 at 23:05
0

You said you have 300 million rows in the Database but I am not sure how many rows you end up with when you do the unique operation on the ID column?

I am assuming 300 million. You can make a dictionary of all of the 1700 hundred divided dataframes and save the dataframes you like from there.

import pyodbc    
import numpy as np    
import pandas as pd     
db = pyodbc.connect("Driver={SQL Server Native Client 11.0};" "Server=ServerName;""Database;""Trusted_Connection=yes;")     
df = pd.read_sql_query("SELECT* from DataBase", db)     
ID = df['ID'].unique()  
ID=ID.assign(divider=ID.index//1700)
d = {y:x for y,x in ID.groupby('divider')}
#save first dataframe
d[0].to_csv('first.csv',columns=['ID'],index=False)

This should save you a bit memory.

Kartikeya Sharma
  • 1,335
  • 1
  • 10
  • 22