Find a duplicate row with max data

Question

I have a csv file like this:

Date of event       Name        Date of birth
06.01.1986          John Smit   23.08.1996
18.12.1996          Barbara D   01.08.1965
12.12.2001          Barbara D   01.08.1965
17.10.1994          John Snow   20.07.1965

I have to find unique rows by "Name" and "Date of birth" (may be with some other columns) but with MAX date .

So i have to get csv file like this:

Date of event       Name        Date of birth
06.01.1986          John Smit   23.08.1996
12.12.2001          Barbara D   01.08.1965
17.10.1994          John Snow   20.07.1965

How to do that? I don`t have any ideas..

Find unique rows , and also I need to combine this solution with source columns...and also write to csv — Alexandr Lebedev, Aug 30 '17 at 08:04
What do you mean by combine with source?Unique ones are from source,and if combine with nonunique one,the result is polluted. — Shihe Zhang, Aug 30 '17 at 08:09
Pandas writing dataframe to CSV file,https://stackoverflow.com/q/16923281/1278112 — Shihe Zhang, Aug 30 '17 at 08:10

score 0 · Answer 1 · answered Aug 30 '17 at 03:25

0

import pandas as pd

# read the csv in with pandas module

df = pd.read_csv('pathToCsv.csv', header=0, parse_dates=[0, 2])

# set the column names as more programming friendly  i.e. no whitespace

df.columns = ['dateOfEvent','name','DOB'] # and probably some other columns ..

# keep row only with max (Date of event) per group ( name, Date of Birth )

yourwish = =df.groupby(['Name','DOB'])['dateOfEvent'].max()

answered Aug 30 '17 at 03:25

yukclam9

336
2
12

thanks a lot , it helps me to find this rows, but I also need to combine the result with source csv-columns – Alexandr Lebedev Aug 30 '17 at 08:05
what does this mean – yukclam9 Aug 31 '17 at 10:33

Dewald Abrie · Accepted Answer · 2017-08-31T04:43:42.937

Formatting

Since your column names have spaces, it's better to delimit with commas.

Algorithm

You can use the pandas library to do this:

import tempfile
import pandas

# create a temporary csv file with your data (comma delimited)
temp_file_name = None
with tempfile.NamedTemporaryFile('w', delete=False) as f:
    f.write("""Date of event,Name,Date of birth
06.01.1986,John Smit,23.08.1996
18.12.1996,Barbara D,01.08.1965
12.12.2001,Barbara D,01.08.1965
17.10.1994,John Snow,20.07.1965""")
    temp_file_name = f.name

# read the csv data using the pandas library, specify columns with dates
data_frame = pandas.read_csv(
    temp_file_name,
    parse_dates=[0,2],
    dayfirst=True,
    delimiter=','
)

# use groupby and max to do the magic
unique_rows = data_frame.groupby(['Name','Date of birth']).max()

# write the results
result_csv_file_name = None
with tempfile.NamedTemporaryFile('w', delete=False) as f:
    result_csv_file_name = f.name
    unique_rows.to_csv(f)

# read and show the results
with open(result_csv_file_name, 'r') as f:
    print(f.read())

This results in:

Name,Date of birth,Date of event
Barbara D,1965-08-01,2001-12-12
John Smit,1996-08-23,1986-01-06
John Snow,1965-07-20,1994-10-17

But If I want to write result of this, what can I do ? I need csv grouped by max date with all columns of source csv. — Alexandr Lebedev, Aug 30 '17 at 07:52
@AlexandrLebedev I updated my answer to also write out to csv. You should really just use google to lookup some documentation. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html — Dewald Abrie, Aug 31 '17 at 04:44

Find a duplicate row with max data

2 Answers2

Formatting

Algorithm