Pandas dataframe, checking if row has exceptions and creating multiple string variables depending if row shows up multiple times

Question

I'm stuck on figuring this out. I have the following df I created from output. I'm looping through to get the table_name, schema, column_name and comment. I want to make sure there is a comment present, and if there is create a string query that appends to the list query_list.

output =  [['table_name', 'schema_name', 'column_name', 'data_type', 'null?', 'default', 'kind', 'expression', 'comment', 'database_name', 'autoincrement'], ['ACCOUNT', 'SFO', '_LOAD_DATETIME', '{"type":"TIMESTAMP_LTZ","precision":0,"scale":9,"nullable":true}', 'TRUE', '', 'COLUMN', '', 'DATE of Account', 'VE'], ['ACCOUNT', 'SFO', '_LOAD_FILENAME', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', 'file name', 'VE'], ['ACCOUNT', 'SFO', '_LOAD_FILE_TIMESTAMP', '{"type":"TIMESTAMP_NTZ","precision":0,"scale":9,"nullable":true}', 'TRUE', '', 'COLUMN', '', '', 'VE'], ['CUSTOMER', 'SFO', 'SUBSCRIPTIONSLIST', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'VE'], ['CUSTOMER', 'SFO', 'CONTACTROLESLIST', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', 'list of contract', 'VE'], ['DATA', 'SFO', 'OPPORTUNITY_NAME', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'VE']]

output = filter(bool, output)
df = pd.DataFrame(output)


df.columns = df.iloc[0]
df = df[1:]

query_list = []
grouped_comments = ''

for index, row in df.iterrows():
    if row['comment'] is not None and row['comment'] != '':
        
        if len(row['table_name']) > 1:
            # the below doesn't work, it groups all table comments together

            sql = f"(COLUMN {row['column_name']} COMMENT '{row['comment']}')"
            grouped_comments = grouped_comments + sql

        elif len(row['table_name']) == 1:
            sql = f"ALTER TABLE {row['schema_name']}.{row['table_name']} ALTER COLUMN {row['column_name']} COMMENT '{row['comment']}';"

        query_list.append(sql)

Now the part I'm stuck on getting, is if there is a comment present and if the table_name shows up more then once then it should create a string that looks like the below, so it should be fetching all the column_name and comment and grouping into one string for that table_name:

"ALTER TABLE VE.ACCOUNT ALTER (COLUMN _LOAD_DATETIME COMMENT 'DATE of Account', COLUMN _LOAD_FILENAME COMMENT 'file name');"

and the elif works because if table_name only shows up once then it populates the correct string:

"ALTER TABLE VE.CUSTOMER ALTER COLUMN CONTACTROLESLIST COMMENT 'list of contract';"

so finally if I have the 2 above strings then my query_list should look like:

query_list = ["ALTER TABLE VE.ACCOUNT ALTER (COLUMN _LOAD_DATETIME COMMENT 'DATE of Account', COLUMN _LOAD_FILENAME COMMENT 'file name');",
 "ALTER TABLE VE.CUSTOMER ALTER COLUMN CONTACTROLESLIST COMMENT 'list of contract';"]

Could you match your sample data, current result, and expected output? There is no `w` or `'x'` in your sample data. It is easier to follow if those are matching. You can modify either the sample data or result, whichever it is easy for you. — Emma, Mar 01 '22 at 17:33
Do you need alter statement only where it has a comment? Do you discard any rows with no comments? — Emma, Mar 01 '22 at 17:49
@Emma yes only when there is a comment present, otherwise the other rows are unused — KristiLuna, Mar 01 '22 at 18:34

Emma · Accepted Answer · 2022-03-01T19:00:21.403

First, you can filter out unnecessary rows.

df = df[df.comment.notnull() & (df.comment.str.len() > 0)]

Then, concatenate the column statement.

df['column_prop'] = 'COLUMN ' + df.column_name + ' COMMENT \'' + df.comment + '\''

Now, aggregate the column_prop to join the string, group by the table name.

df = df.groupby(['database_name', 'table_name']).agg({'column_prop': lambda x: ', '.join(x)}).reset_index()

This will give you the following.

0 database_name table_name                                                                                 column_prop
0            VE    ACCOUNT  COLUMN _LOAD_DATETIME COMMENT 'DATE of Account', COLUMN _LOAD_FILENAME COMMENT 'file name'
1            VE   CUSTOMER                                          COLUMN CONTACTROLESLIST COMMENT 'list of contract'

From here, you can concatenate the each columns with additional string to obtain your desired string.

If your dataframe is small, you can simply concatenate like this.

df['sql'] = 'ALTER TABLE ' + df.database_name + '.' + ...

Or check here https://stackoverflow.com/a/54298586/2956135 for better string concatenation approach.

thank you this is great! I wanted this to be in a list though because I'm using the snowflakehook and it requires a list https://github.com/apache/airflow/blob/main/airflow/providers/snowflake/hooks/snowflake.py#L269 — KristiLuna, Mar 01 '22 at 19:12
You can do `df.sql.values.tolist()` at the end once you have generated the whole statement in `sql` column — Emma, Mar 01 '22 at 19:13
thank you! I got too caught up with looping, but pandas has a lot to offer without having to do that. thanks! — KristiLuna, Mar 01 '22 at 19:15

Pandas dataframe, checking if row has exceptions and creating multiple string variables depending if row shows up multiple times

1 Answers1