Remove Duplicate Line According to the splitting Value

Question

name  :  major   : start time : mark
Dean  :  English :  05:00:00  : 70
Dean  :  Japan   :  06:00:00  : 80
Sam   :  France  :  07:00.00  : 60

The above sample is in sample.txt. I wanna get the output as following instruction. Could you please help me.

I wanna split lines by (:) and read line by line
if duplicate name is found, name with earlier start time will be outputted. for the above sample.txt, Dean : English : 05:00:00 : 70 will be output.

with open("sample.txt", mode="r") as f:
   text = f.readlines()
   for line in text:
       line_data = line.split(":")
       name = line_data[0].strip()
       ---------
       ---------
       ---------

This sounds a lot like an assignment. Please try to solve it yourself first. And please read [Asking about homework](https://meta.stackoverflow.com/questions/334822/how-do-i-ask-and-answer-homework-questions) to get a sense of how to ask good questions related to assignments. — joanis, Jan 19 '22 at 13:51
You can append "names" inside a dictionary and other attributes as a value for that, then if a "name"(key) already exists, you can compare the time, and print the lower time. — Ahmad Anis, Jan 19 '22 at 13:53
@AhmadAnis this is an easy-concept (dict) hint. I reused it in my answer to approach de-duplication (though not directly integrated into parsing (dict by name), but within a separate function). Would love to see your answer (even as pseudo-code). — hc_dev, Jan 20 '22 at 00:44

hc_dev · Answer 1 · 2022-01-20T01:00:29.743

You have the starter code already.

Recipe to follow as guideline

Let's try to de-compose the problem into parts. Then we can solve those sub-problems step by step, separately. (Similar to the problem-solving strategy divide and conquer).

3 parts to solve

I would split into 3 parts, following the IPO model: Input > Process > Output.

(1) read the file and split = parse the records (input)

Inside the loop:

Try to recognize the field-reading pattern and add the remaining fields similarly: field = line_data[3] (for the 4th field).
Then think of collecting the fields you have read into a record dictionary dict (something like name-value pairs for each line.
After all fields are read and stored in a record, you can add it to the collection of parsed records, like a list you created before: (list.append(dict)).

(2) sort and filter the parsed records = de-duplicate (process)

Outside, after the loop:

Work with the list and try sorting or filtering it to remove the duplicates as required.

(3) format the filtered records (output)

Format the parsed and de-duplicated records back to a string. Then output the string (either print to console or write to a file).

(1) Code explained and prepared to extend

# WHAT YOU ALREADY HAVE, EXPLAINED WITH COMMENTS
with open("sample.txt", mode="r") as f:  # open file ("sample.txt") to read ('r') by using handle f
   text = f.readlines()  # read all lines into the list named text

# IDEA: create an empty list to collect named records
   records = []

   for line in text:  # iterate through each of the lines in text
       line_data = line.split(":")  # split each line by delimiter ":" into a list of tuples (fields or columns)
       name = line_data[0].strip()  # trim the 1. field or column (containing the name) to remove spaces around

# HERE YOU CAN BUILD ON
       # read the other fields
       course = line_data[1].strip()  #  2. field trimmed
       # same with 3. and 4. field

# IDEA: dict with named records you need later for filtering
       record = {'name': name, 'course': course, 'time': '00:00', 'points': 0}
       records.append(record)  # add the parsed record to the collection

Consider putting your existing parsing logic (1) into a function:

# (1) reading file and parsing records
def parse_records():
    # add your existing code here
    return records

Then you can easily add new functions (2 and 3):

# your main script starts here calling sub-routines (functions)
if __name__ == '__main__':
    records = parse_records()  # your first part solved (1)
    print(records)  # debug output to see if parsing
    # now solve the sorting and filtering (2)
    # then print out the filtered records as formatted string (3)

(2) Sorting/Filtering

Find duplicates and sort/filter them for earliest time.

In pseudo-code (using unimplemented functions):

def filter_duplicates_by_name(list):
    previous = None
    for record in sort_by_name(list):
        if previous != None and previous['name'] == record['name']:
            print("name duplicate found: " + str(record))
            # if not already sorted by time, compare times and 
            #   either (a) put earlier to result instead later
            #   or (b) remove later from list (filtered) 
        else:
            # no duplicate, or not yet: consider adding it to result
        previous = record

     # return either (a) result or (b) filtered list
     return list

Implement the function sort_by_name(list). It should sort the list by name (and maybe also time next) and return the sorted result.

def sort_by_name(list):
    sorted = []
    #  sort the list, e.g. using for-loop and if-else
    return sorted

Then you can use it to output the filtered records:

filtered = filter_duplicates_by_name(records)
for record in filtered:
    print(filtered)
    # or formatted back to colon-separated values

(3) Formatting parsed records as string

Since Python 3.6 you can use f-String to format strings easily: how to do a dictionary format with f-string in python 3.6?
An alternative would be to join the dictionary values: Converting dictionary into string

You may recognize join as the counter-part to split.

Python basics & tutorials applicable here

Data structures:

Control flow:

Python For Loops
Python Conditions, if .. else
Python Functions, to give the program structure and divide parts

String formatting:

Tutorials for sorting:

Here explicitly chosen a simpler approach with entry-level concepts (for-loops, list, dict) to keep focus on problem solving and learn basic data-structures and control-flow. — hc_dev, Jan 19 '22 at 23:04
thanks for your help and detail explanation bro. It helps me alot. — Samael2021, Jan 20 '22 at 01:57
@Samael2021 Good to see my effort helped. Consider updating your question with your attempts as [example]. Keep giving feedback when [someone answers](https://stackoverflow.com/help/someone-answers). — hc_dev, Jan 20 '22 at 06:12

S.B · Accepted Answer · 2022-01-19T14:23:07.397

I've put comments for explanation :

from itertools import groupby
from time import strptime
from re import search

with open('s.txt') as f:
    # skipping the header
    header = next(f)
    lines = [line.strip() for line in f]

# we need to sort the list first because we will use groupby later. This
# sorting is based on name (first item in every line after splitting)
lines.sort(key=lambda x: x.split(':', maxsplit=1)[0].strip())

# Now we are ready to group by again based on the previous condition I mean
# (first item in every line after splitting).
lists = [list(g) for _, g in
         groupby(lines, key=lambda x: x.split(':', maxsplit=1)[0].strip())]

time_pattern = r'\d{2}:\d{2}:\d{2}'

# It's time to iterate over the list and print the items. Lists with more than
# one item need to be sorted because they had duplicate names. This sorting
# is based on the time. We will first get the time with regex then create an
# `time` object from it. `time` objects are sortable. After sorting, 
# we just want the first item to be printed which has the smallest time.
for item in lists:

    if len(item) > 1:
        item.sort(
            key=lambda x: strptime(search(time_pattern, x).group(), "%H:%M:%S"))
        print(item[0])
    else:
        print(item[0])

output:

Dean  :  English :  05:00:00  : 70
Sam   :  France  :  07:00.00  : 60

The solution is well explained, but uses advanced concepts (list comprehension, lambda) and modules (itertools, time). Could be overwhelming for a newbie asking an assignment question. — hc_dev, Jan 19 '22 at 23:02