Loop on pandas dataframe over unique values only

Question

I have the following pandas dataframe:

DB      Table   Column  Format

Retail  Orders  ID      INTEGER
Retail  Orders  Place   STRING
Dept    Sales   ID      INTEGER
Dept    Sales   Name    STRING

I want to loop on the Tables, while generating a SQL for creating the tables. e.g.

create table Retail.Orders ( ID INTEGER, Place STRING)
create table Dept.Sales ( ID INTEGER, Name STRING)

What I've already done is get distinct db & tables using drop_duplicate and then for each table apply a filter and concatenate the strings to create a sql.

def generate_tables(df_cols):
    tables = df_cols.drop_duplicates(subset=[KEY_DB, KEY_TABLE])[[KEY_DB, KEY_TABLE]]

    for index, row in tables.iterrows():
        db = row[KEY_DB]
        table = row[KEY_TABLE]

        print("DB: " + db)
        print("Table: " + table)

        sql = "CREATE TABLE " + db + "." + table + " ("
        cols = df_cols.loc[(df_cols[KEY_DB] == db) & (df_cols[KEY_TABLE] == table)]
        for index, col in cols.iterrows():
            sql += col[KEY_COLUMN] + " " + col[KEY_FORMAT] + ", "

        sql += ")"

        print(sql)

Is there a better approach to iterate over the dataframe?

jpp · Answer 1 · 2018-03-22T09:48:31.783

This is the way I would do it. First create a dictionary via df.itertuples [more efficient than df.iterrows], then use str.format to include the values seamlessly.

Uniqueness is guaranteed in dictionary construction by using set.

I also convert to a generator so you can iterate it efficiently if you wish; it's always possible to exhaust the generator via list as below.

from collections import defaultdict

d = defaultdict(set)
for row in df.itertuples():
    d[(row[1], row[2])].add((row[3], row[4]))

def generate_tables_jp(d):
    for k, v in d.items():
        yield 'CREATE TABLE {0}.{1} ({2})'\
              .format(k[0], k[1], ', '.join([' '.join(i) for i in v]))

list(generate_tables_jp(d))

Result:

['CREATE TABLE Retail.Orders (ID INTEGER, Place STRING)',
 'CREATE TABLE Dept.Sales (ID INTEGER, Name STRING)']

emmet02 · Answer 2 · 2018-03-22T09:45:17.353

If looping is what you want, then yes .iterrows() is the most efficient way to get through a pandas frame. EDIT: From other answer, and link here - Does iterrows have performance issues? - I believe that .itertuples() is actually a better generator for performance.

However, depending on the size of the dataframe you might be better off using some pandas groupby functions to assist

consider something like this

# Add a concatenation of the column name and format
df['col_format'] =  df['Column'] + ' ' + df['Format']

# Now create a frame which is the groupby of the DB/Table rows and 
# concatenates the tuples of col_format correctly
y1 = (df.groupby(by=['DB', 'Table'])['col_format']
        .apply(lambda x: '(' + ', '.join(x) + ')'))

# Reset the index to bring the keys/indexes back in as columns
y2 = y1.reset_index()

# Now create a Series of all of the SQL statements
all_outs = 'Create Table ' + y2['DB'] + '.' + y2['Table'] + ' ' + y2['col_format']

# Look at them!
all_outs.values
Out[44]: 
array(['Create Table Dept.Sales (ID INTEGER, Name STRING)',
       'Create Table Retail.Orders (ID INTEGER, Place STRING)'], dtype=object)

Hope this helps!

I like this answer (and upvote it), but be careful. `groupby` is not *always* faster than `itertuples` for large dataframes, e.g. look at [this answer](https://stackoverflow.com/a/49413710/9209546). Remember that `groupby` should have O(n log n) complexity, but its performance is often driven by the function, and `lambda` is inefficient. — jpp, Mar 22 '18 at 10:05

score 0 · Answer 3 · answered Mar 22 '18 at 11:37

You can use first assemble the info per line in an extra column, and then use groupby.sum

queries = df[KEY_COLUMN] + ' ' + df[KEY_FORMAT] + ', '
queries.index = df.set_index(index_labels).index

DB      Table 
Retail  Orders      ID INTEGER, 
        Orders    Place STRING, 
Dept    Sales       ID INTEGER, 
        Sales      Name STRING, 
dtype: object

queries = queries.groupby(index_labels).sum().str.strip(', ')

DB      Table 
Dept    Sales      ID INTEGER, Name STRING
Retail  Orders    ID INTEGER, Place STRING
dtype: object

def format_queries(queries):
    query_pattern = 'CREATE TABLE %s.%s (%s)'
    for (db, table), text in queries.items():# idx, table, text
        query = query_pattern % (db, table, text)
        yield query
list(format_queries(queries))

['CREATE TABLE Dept.Sales (ID INTEGER, Name STRING)',
 'CREATE TABLE Retail.Orders (ID INTEGER, Place STRING)']

This way you don't need the lambda. I don't know whether this approach or the itertuples will be fastest

Loop on pandas dataframe over unique values only

3 Answers3