How to use df.apply() by passing selected columns of a dataframe to another function

Question

I have a dataframe with multiple columns. Two of the columns in the dataframe are table_name and column_name. I need to check whether the column_name is available in the table_name in an SQL Server database.

Now, I have a function check_column_existence() for this, which seems to work fine, if I use the traditional method of reading the csv file and calling the function:

def check_column_existence(dbHost, dbPort, dbName, dbUser, dbPassword, table_name, column_name):
    sqlQuery="""select count(*) from 
INFORMATION_SCHEMA.columns 
where table_name='{}' and column_name='{}';""".format(table_name, column_name)
    conn = pymssql.connect(host=dbHost, port=dbPort, user=dbUser, password=dbPassword, database=dbName)
    cursor = conn.cursor()
    sql = sqlQuery.encode('utf-8')
    cursor.execute(sql)
    for row in cursor:
        if row[0] == 1:
            result = True
        else:
            result = False
    print(result)
    return result

def populate_db_attribute_existence(dbHost, dbPort, dbName, dbUser, dbPassword, input_csv_file_path):
    input_csv_file_path = os.path.abspath(input_csv_file_path)
    input_csv_folder_path = os.path.dirname(input_csv_file_path)
    input_csv_file_name = os.path.basename(input_csv_file_path)
    temp1_output_file_path = "{}/temp1_{}".format(input_csv_folder_path, input_csv_file_name)
    temp2_output_file_path = "{}/temp2_{}".format(input_csv_folder_path, input_csv_file_name)

    df = pd.read_csv(input_csv_file_path)
    df['DB_Entity'] = df['DB_Entity'].str.replace(' ', '')
    df['DB_Attributes'] = df['DB_Attributes'].str.replace(' ', '')
    df2 = df[['API_Attributes', 'DB_Entity', 'DB_Attributes']]
    df2.to_csv(temp1_output_file_path, index=False, encoding='utf-8')

    with open(temp1_output_file_path,'r') as ftemp1, open(temp2_output_file_path,'w') as ftemp2:
        i = 1
        for row in ftemp1:
            if i > 1:
                row = row.strip()
                list_columns = row.split(',')
                api_attr_name = list_columns[0]
                db_entity_name = list_columns[1]
                c_name = list_columns[2]
                t_name = 'dbo.{}'.format(db_entity_name)
                result = check_column_existence(dbHost, dbPort, dbName, dbUser, dbPassword, t_name, c_name)
                ftemp2.write(','.join([api_attr_name, db_entity_name, c_name, str(result)])+'\n')
            i += 1



    del(df2)
    df2 = pd.read_csv(temp2_output_file_path, names=['API_Attributes', 'DB_Entity', 'DB_Attributes', 'DB_Attr_Exist'])
    merge_key = ['API_Attributes', 'DB_Attributes']
    merged_df = pd.merge(df, df2, on=merge_key, how='outer')
    merged_df.to_csv(input_csv_file_path, index=False, encoding='utf-8')
    os.remove(temp1_output_file_path)
    os.remove(temp2_output_file_path)
    print("Completed DB Attributes existence check: {}".format(input_csv_file_path))

Updated this question with my current approach. Please recommend if there are faster ways to achieve the same. How should I use df.apply() or a lambda function in this context, if possible?

Appreciate your help or recommendation in advance.

[Are you sure you want to be using `apply`?](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code) — cs95, Mar 20 '19 at 05:06
@coldspeed Not necessarily. My current solution is now updated along with my question, I was just curious if there is a better, more efficient way to get this done. — Keshav Prabhu, Mar 20 '19 at 06:18

score 0 · Answer 1 · answered Mar 20 '19 at 09:26

There are different possible improvements in your code.

First (which is your question), you use temporary csv files only to browse a dataframe row wise. apply is bad as explained by @coldspeed, but using the disk is even worse.

As you already have a Python function that you want to apply row wise to an existing dataframe to add a new column, it is an acceptable use case for apply. A direct access to the underlying numpy arrays is certainly faster, but pandas has been created to have a convenient framework and convenience comes at a cost. So here you could add a new column to your dataframe with:

df = pd.read_csv(input_csv_file_path)
df['DB_Entity'] = df['DB_Entity'].str.replace(' ', '')
df['DB_Attributes'] = df['DB_Attributes'].str.replace(' ', '')
df['DB_Attr_Exist'] = df.apply(lambda x: check_column_existence(
                          dbHost, dbPort, dbName, dbUser, dbPassword,
                          'dbo'+x['DB_Entity'], x['DB_Attributes'])

Next in the check_column_existence function, you create a new database connection for every line. That is a rather expensive operation, and you really should open it once before processing the file and close it only at the end.

Thank you for spending time to review my question and for providing your feedback. I will definitely incorporate both your suggestions — Keshav Prabhu, Mar 22 '19 at 03:50

How to use df.apply() by passing selected columns of a dataframe to another function

1 Answers1