SQL iterate over distinct values of a column and build a data frame for each value

Question

I have a table that looks like the following:

|A|B|C|D|
|---|---|---|---|
|1|b1|c1|d1|
|1|b2|c2|d2|
|2|b3|c3|d3|
|2|b4|c4|d4|

I would like to iterate over distinct values of A and build a pandas data frame out of the remaining columns and then use that table to do calculations. I tried the following:

import sqlite3
import pandas as pd
conn = sqlite3.connection('my_db.db')
c = conn.cursor()
for entry in c.execute("SELECT DISTINCT A in table):
    df = pd.DataFrame(c.execute("SELECT * FROM table WHERE A = ?", (entry[0],)).fetchall())

This doesn't work because the second cursor object that builds the dataframe overwrites the cursor object that i was iterating over. I also discovered that you can not have two cursor objects. How should I work around this?

score 1 · Answer 1 · answered Aug 16 '16 at 23:52

1

Is there a particular reason you don't want to do this whole operation in pandas itself? You could simply do it like so:

parent_df = pd.read_sql(c, "SELECT * from table")
for name, group in parent_df.groupby('A'):
    print(name, group.head())

Or

parent_df.set_index('A', inplace=True)
parent_df.head(20)

answered Aug 16 '16 at 23:52

Kartik

8,347
39
73

Unfortunately I can not read the whole table into memory at once so I can't use pandas in that fashion. Good to know that is an option though! – deltap Aug 17 '16 at 18:12
`pandas.read_sql` has `chunksize` kwag. Sort your table using SQL query, then use the `chunksize` option. – Kartik Aug 17 '16 at 18:18

score 1 · Answer 2 · answered Aug 17 '16 at 00:01

1

Put all the data you're interested in into a DataFrame (if it's not a huge dataset) then filter the dataset.

df = pd.DataFrame(c.execute("SELECT * FROM table").fetchall())
distict_a = df['A'].unique()
for a in distinct_a:
    df_for_this_a = df.query[df.A == a]

answered Aug 17 '16 at 00:01

it's-yer-boy-chet

1,917
2
12
21

1

This works but part of the challenge is I can not load the entire table into memory. – deltap Aug 17 '16 at 16:20

Parfait · Answer 3 · 2016-08-17T17:11:11.167

1

Consider using pandas's read_sql (with parameterization in passing the cursor value) and iteratively save each dataframe to a dictionary where the reference key is the corresponding distinct value (dict route avoids multiple dfs in your global environment):

import sqlite3
import pandas as pd

conn = sqlite3.connect('my_db.db')
c = conn.cursor()

dfDict = {}
for entry in c.execute("SELECT DISTINCT A FROM table"):
    strSQL = "SELECT * FROM table WHERE A = :nameofparam"   
    dfDict[entry[0]] = pd.read_sql(strSQL, conn, params={'nameofparam': entry[0]})

c.close()
conn.close()

for k, v in dfDict.items():
    print(k, '\n', v.head())

edited Aug 17 '16 at 17:11

answered Aug 17 '16 at 02:55

Parfait

104,375
17
94
125

I know that for regular SQL queries it is important to use "?" to protect against sql attacks. Is the code above safe? Can pd.read_sql() execute code? – deltap Aug 17 '16 at 16:25
See update using the params arg of `read_sql` which passes the cursor value to a named parameter binded to sql statement. This protects against sql injection. – Parfait Aug 17 '16 at 17:13

score 0 · Accepted Answer · edited May 23 '17 at 12:33

The end solution was to use pandas.read_sql with chunksize

I found this post useful as well.

import sqlite3
import pandas as pd
conn = sqlite3.connection('my_db.db')
for df in pd.read_sql("SELECT * from table ORDER BY A ASC", conn, chunksize = 100000):
    group  = df.groupby('A')
    last   = group.first().tail(1).index.values[0]
    last_a = 0
    for a, g_df in group:
        if (a == last_a):
            g_df = l_df.append(g_df)

        ....calculations....

        if (a == last):
            l_df = g_df
            l_a  = a

It is really important to have logic that ties together the groupby data frames that are split into two different chunks.

SQL iterate over distinct values of a column and build a data frame for each value

4 Answers4