0

Looking for pythonic way of finding duplicate values in a text file.

1||mike||jones||38||first street||2018-05-01
2||michale||jones||38||8th street||2018-05-01
3||mich||jones||38||9th street||2018-05-01
4||mitchel||jones||38||10th street||2018-05-01
1||mike||jones||38||first street||2018-12-01

trying to find duplicate id column and keep most recent? Would I just loop over output insert id's into list then check if value was already in the list?

user3525290
  • 1,557
  • 2
  • 20
  • 47
  • What do you mean by keeping the most recent? Do you want to remove all duplicates but the most recent? Please provide the final text file you would like to get. – sentence Apr 22 '19 at 20:30
  • Maybe this [question](https://stackoverflow.com/questions/55568559/trying-to-sort-through-a-csv-file-in-python-create-list-and-remove-duplicates?rq=1) will help. – m13op22 Apr 22 '19 at 20:36
  • sorry by most recent I meant using the last column's date. for instance column 0 is id's. in the file i have duplicate 1's but the last entry in this case has a more current date than the first id 1. so I need that instance and not the first instance. but of course it can be anywhere in the file not always the last entry. – user3525290 Apr 23 '19 at 11:21

2 Answers2

1

We have got very powerful library Pandas available to perform analytical operation with minimum lines of code.

Basically pandas is an open source python package that provides numerous tools for data analysis. Some basic advantages and uses of pandas are listed below:

  1. It can present data in a way that is suitable for data analysis.
  2. The package contains multiple methods for convenient data filtering.
  3. It has a variety of utilities to perform Input/Output operations.

Implementaion of the case that you wanna achive using pandas

First install pandas using pip install pandas

i/p > A text file with input data in given format

o/p > A text file with required output in csv format


import pandas as pd
from datetime import datetime

with open("input") as file:     # Read input
    headers = ["id", "first_name", "last_name", "age", "address", "date"]
    dtypes = [int, str, str, int, str, datetime]
    data_frame = pd.read_csv(file, sep='[|][|]', names=headers, header=None,  parse_dates=['date'],
                             engine="python")   # Read data into data frame from csv
    data_frame.sort_values(data_frame.date.name, ascending=False, inplace=True)     # Sort the values based on dates
    data_frame.drop_duplicates(subset=data_frame.id.name, inplace=True)     # Delete duplicate rows based on id
    data_frame.to_csv('output', sep=',', header=None)   # Generate outpu

Kaushal Pahwani
  • 464
  • 3
  • 11
1
import pandas as pd
import numpy as np

f= open("sample.txt","w+")
f.write("1||mike||jones||38||first street||2018-05-01\n2||michale||jones||38||8th street||2018-05-01\n3||mich||jones||38||9th street||2018-05-01\n4||mitchel||jones||38||10th street||2018-05-01\n1||mike||jones||38||first street||2018-12-01")
f.close()

#read the delimited file with appropriate dataType(numpy.datetime64) for date field
tbl= pd.read_csv("sample.txt",sep='\|\|',names=("id","firstName","lastName","age","address","applicationDate"),dtype={"id":np.int,"firstName":np.str,"lastName":np.str,"age":np.int,"address":np.str,"applicationDate":np.datetime64})


#Note-
#Records with ID=2,3,4 are distinct based on address
#only record with id=1 is dupelicate. Hence source system is taking care of identification of duplicate regestration.
#So We'll only need to identify duplicates based on ID & recent record based on application date(No need to re-implement any logic for dupelicate identification).


for id in set(tbl["id"]):
    #create the temperory dataFrame for rows consist of given id and rank based on value in each field. 
    tempRankDF = tbl.loc[tbl["id"]==id].rank(ascending=False)

    #Note- rank function will calculate rank for each field based on it's dataType. 
    #Hense we used dataType for field "appilcationDate"=numpy.datetime64. 
    #So that when we calculate the rank in descending order on "applicationDate" then recent record will have rank==1

    #Get the index of recent record wrt original dataFrame
    recentRowIndex = tempRankDF.loc[tempRankDF["applicationDate"]==1].index[0]

    print(tbl.iloc[recentRowIndex])


#Note: Update the code inside for loop as per your convinence to write final resultset to either file or another dataFrame or to the database.
#You can directly execute this code & check the resultset.