Dataframe creating copies, not sure what went wrong

Question

My python code:

import pandas as pd

student_dict = {
"ID":[101,102,103,104,105],
"Student":["AAA","BBB","CCC","DDD","EEE"],
"Mark":[50,100,99,60,80],
"Address":["St.AAA","St.BBB","St.CCC","St.DDD","St.EEE"],
"PhoneNo":[1111111111,2222222222,3333333333,4444444444,5555555555]
}

df = pd.DataFrame(student_dict)

print("First dataframe")
print(df)

fName = "Student_CSVresult.csv"

def dups(df):
    df = df.drop_duplicates(keep=False)
    return df

try:
    data = pd.read_csv(fName)
    print("CSV file data")
    print(data)
    df_merged = pd.concat([data, df])
    df = dups(df_merged)
    print("After removing dups")
    print(df)
    df.to_csv(fName, mode='a', index=False,header=False)
except FileNotFoundError:
    print("File Not Found Error")
    df = df.drop_duplicates()
    df.to_csv(fName, index=False)
    print("New file created and data imported")
except Exception as e:
    print(e)

on the first run, all data was imported without any copies.

I gave a different dataframe in the next run

student_dict = {
"ID":[101,102,103,104,105,101,102,103,104,105,106,107],
"Student":["AAA","BBB","CCC","DDD","EEE","AAA","BBB","CCC","DDD","EEE","YYY","ZZZ"],
"Mark":[50,100,99,60,80,50,100,99,60,80,100,80],
"Address":["St.AAA","St.BBB","St.CCC","St.DDD","St.EEE","St.AAA","St.BBB","St.CCC","St.DDD","St.EEE","St.AYE","St.ZZZ"],
"PhoneNo":[1111111111,2222222222,3333333333,4444444444,5555555555,1111111111,2222222222,3333333333,4444444444,5555555555,6666666666,7777777777]
}

also no issues, then I gave the first dataframe again.

student_dict = {
"ID":[101,102,103,104,105],
"Student":["AAA","BBB","CCC","DDD","EEE"],
"Mark":[50,100,99,60,80],
"Address":["St.AAA","St.BBB","St.CCC","St.DDD","St.EEE"],
"PhoneNo":[1111111111,2222222222,3333333333,4444444444,5555555555]
}

and it created copies

Can someone help me how to solve this issue? i don't want to overwrite the main file(Student_CSVresult.csv), just append only

also, is there any way to create a new column in the file that will auto-capture the timestamp of the data entry?

add mode='a' in to_csv function to append – YUVI_1303 Oct 16 '22 at 15:55 — YUVI_1303, Oct 16 '22 at 15:55

score 1 · Accepted Answer · answered Oct 16 '22 at 16:41

A description of your program at the moment:

df contains new records.
The content of "Student_CSVresult.csv" is loaded into data
df and data are merged into df_merged
Duplicates are removed from df_merged with keep=False, so that df_merged contains only records that appear exactly once in either df or data
Content of df_merged is appended to "Student_CSVresult.csv"

I don't think this is quite what you are trying to do. The problem is in this step:

Duplicates are removed from df_merged with keep=False, so that df_merged contains only records that appear exactly once in either df or data

In the third step in your test, when you present the data from the first step again, the records that are presented in Step 2 but not in Step 1 occur exactly once (in the csv file only), so they are appended to the csv file resulting in duplicates in the csv file, which doesn't appear to be what you want.

It is not 100% clear in your question, but I think you want to append a single instance of any records in the newly presented student data that are not already in the csv file to the csv. To do this you'll need to find the records that are present in the new data but not in the csv file. The way to do this is described in this answer. Here's a function I wrote based on it:

def in_df2_only(df1, df2):
    merged = pd.merge(df1, df2, how='outer', indicator=True)
    return merged[merged['_merge'] == 'right_only'].drop(columns=['_merge'])

In my edit of your code I changed some variable names etc. to make the code more readable, and made a loop to run all three steps of the test in one go so I didn't have to run the program three times replacing student_dict for every test:

import pandas as pd

student_dicts = [
    {
        "ID":[101,102,103,104,105],
        "Student":["AAA","BBB","CCC","DDD","EEE"],
        "Mark":[50,100,99,60,80],
        "Address":["St.AAA","St.BBB","St.CCC","St.DDD","St.EEE"],
        "PhoneNo":[1111111111,2222222222,3333333333,4444444444,5555555555]
    },
    {
        "ID":[101,102,103,104,105,101,102,103,104,105,106,107],
        "Student":["AAA","BBB","CCC","DDD","EEE","AAA","BBB","CCC","DDD","EEE","YYY","ZZZ"],
        "Mark":[50,100,99,60,80,50,100,99,60,80,100,80],
        "Address":["St.AAA","St.BBB","St.CCC","St.DDD","St.EEE","St.AAA","St.BBB","St.CCC","St.DDD","St.EEE","St.AYE","St.ZZZ"],
        "PhoneNo":[1111111111,2222222222,3333333333,4444444444,5555555555,1111111111,2222222222,3333333333,4444444444,5555555555,6666666666,7777777777]
    },
    {
        "ID":[101,102,103,104,105],
        "Student":["AAA","BBB","CCC","DDD","EEE"],
        "Mark":[50,100,99,60,80],
        "Address":["St.AAA","St.BBB","St.CCC","St.DDD","St.EEE"],
        "PhoneNo":[1111111111,2222222222,3333333333,4444444444,5555555555]
    },
]

fName = "Student_CSVresult.csv"

# ====================
def in_df2_only(df1, df2):
    merged = pd.merge(df1, df2, how='outer', indicator=True)
    return merged[merged['_merge'] == 'right_only'].drop(columns=['_merge'])


# ====================
for student_dict in student_dicts:
    student_df = pd.DataFrame(student_dict)
    print("Student data: ")
    print(student_df)
    try:
        csv_data = pd.read_csv(fName)
        print("Existing CSV data: ")
        print(csv_data)
        not_in_csv = in_df2_only(df1=csv_data, df2=student_df)
        not_in_csv.drop_duplicates().to_csv(fName, mode='a', index=False, header=False)
        print("Records added to CSV: ")
        print(not_in_csv)
    except FileNotFoundError:
        print("File Not Found Error")
        student_df = student_df.drop_duplicates()
        student_df.to_csv(fName, index=False)
        print("New file created and data imported")
    except Exception as e:
        print(e)
    print()
    print('====================')
    print()

The final content of the csv file is:

ID,Student,Mark,Address,PhoneNo
101,AAA,50,St.AAA,1111111111
102,BBB,100,St.BBB,2222222222
103,CCC,99,St.CCC,3333333333
104,DDD,60,St.DDD,4444444444
105,EEE,80,St.EEE,5555555555
106,YYY,100,St.AYE,6666666666
107,ZZZ,80,St.ZZZ,7777777777

Regarding timestamps, you could add something like this each time you write data to the csv: `df['Time added'] = pd.Timestamp.now()` (be sure to replace `df` with the name of whatever dataframe you are about to write). Bear in mind when you load the csv later, the timestamps will be imported as strings, so if you want to do sort etc. you'll need to convert to datetimes after loading with something like: `df['Time added'] = df['Time added'].to_datetime()` — ljdyer, Oct 16 '22 at 16:48

Dataframe creating copies, not sure what went wrong

1 Answers1