Adding data to csv using pandas dataframe, by adding new column

Question

I want to append to a csv file, some data from redshift tables, using the pandas module in python. From python, I can successfully connect and retrieve rows from redshift tables using the psycopg2 module. Now, I am storing datewise data on the csv. So I need to first create a new date column in the csv, then append the data retrieved in that new column.

I am using the following commands to read from redshift tables:

conn=psycopg2.connect( host='my_db_hostname', port=`portnumber`, user='username', password='password', dbname='db')
conn.autocommit = True
cur=conn.cursor()
cur.execute(""" select  emp_name, emp_login_count from public.emp_login_detail where login_date=current_date  """)
records=cur.fetchall()
cur.close()

Now, I want to append these emp_name and emp_login_count columns to the existing csv. Below is a snapshot of csv:

Everyday I need to add new date column in csv and then I need to put the emp_login_count against respective person's name.
I am new to Pandas and have no idea how to implement this. Can someone please help me out?

Can we see the sample/example value of `records`? You have now, login count in `emp_login_count` and you want to change that into today's date, right? — imxitiz, Jul 30 '21 at 03:10
`CSV` isn't append friendly, particularly if you want to append columns. You will have to re-read, add the column, and write it back everytime, which gets more expensive with time. I suggest you rethink how you are storing your data. Since the updates are by date, I would suggest you "partition" your dataset by date, and use something like parquet. If you want to keep it simple, you can manage the partitions yourself, and continue using `CSV`s. — suvayu, Jul 30 '21 at 07:17
You may want to check this answer here: https://stackoverflow.com/a/34485965/8523960 — Laenka-Oss, Aug 07 '21 at 14:10

score 2 · Answer 1 · answered Jul 30 '21 at 03:40

I'm considering that everyday u'll get 1 column with the emp_name data, and other column with the emp_login_count data.

I belive u need to look at:

1 - Read the csv file with the read_csv in pandas, because this will give u a dataframe with the history information.

2 - Everyday u must create a new dataframe, with the information got in the database. If u want, u may use the datetime library and today method to get the todays date and use as a columns title.

3 - Merge the information from the csv with the new data, using merge method in pandas, u'll probably need something like: df_original.merge(df_new_information, left_on='emp_name', right_on='emp_name', how='outer') . The left_on and right_on are the columns names that u'll use as reference.

4 - Write a csv file with the to_csv in pandas, this will create a new or replace the existing csv file.

Sorry for the bad english.

score 2 · Accepted Answer · answered Jul 30 '21 at 06:54

2

Add the following and try it out:

records=cur.fetchall()

# Create a dataframe of the SQL query's result 
column_names = ['emp_name','login_count']
df = pd.DataFrame(records, columns = column_names)
df.head()

Now create another dataframe for the daily login counts csv file

df_daily = pd.read_csv('<INSERT the path_to_csv here>')
df_daily.head()

Merge the two dataframes on the 'emp_name' column

result = df.merge(df_daily, on='emp_name')
result.head()

After the join, you can rename the 'login_count' column to today's date

result.rename(columns = {'login_count':'< INSERT date here>'}, inplace = True)

You can then save the new data into a csv file again:

pd.to_csv('<INSERT name of file.csv>', index=False)

answered Jul 30 '21 at 06:54

Shrawan Sapre

70
6

Thank you very much @Shrawan Sapre. This has completely solved my problem. I just wanted to ask one thing, I was reading the file from S3 and need to replace the original file there with the new csv, or save the changes to the original file. Is there any way this can be done? – Sidhant Gupta Aug 09 '21 at 02:46
I use the python library s3fs to read and write csvs with pandas. You can try using the packages boto3 or s3fs whichever works best for you. Here is an article giving an example: https://towardsdatascience.com/reading-and-writing-files-from-to-amazon-s3-with-pandas-ccaf90bfe86c – Shrawan Sapre Aug 10 '21 at 03:16

Adding data to csv using pandas dataframe, by adding new column

2 Answers2