0

I have this text file and let's say it contains 10 lines.

Bye
Hi
2
3
4
5
Hi
Bye
7
Hi

Every time it says "Hi" and "Bye" I want it to be removed except for the first time it was said. My current code is (yes filename is actually pointing towards a file, I just didn't place it in this one)

text_file = open(filename) 
for i, line in enumerate(text_file):
    if i == 0:
       var_Line1 = line
    if i = 1:
       var_Line2 = line
    if i > 1: 
       if line == var_Line2:
          del line
text_file.close()

It does detect the duplicates, but it takes a very long time considering the amount of lines there are, but I'm not sure on how to delete them and save it as well

Panderex
  • 5
  • 4
  • 1
    Do you want to remove all duplicates or just duplicates of the first 2 lines? – Axe319 Sep 16 '22 at 14:21
  • Mb for the late reply. Yes all duplicates except for the first 2 lines. I've gotten it down to remove duplicates from the 2nd line, but the first line has the same info, except for 1 number changing which I want to remove as well – Panderex Sep 16 '22 at 15:38

2 Answers2

1

Using a set & some basic filtering logic:

with open('test.txt') as f:
    seen = set()  # keep track of the lines already seen
    deduped = []
    for line in f:
        line = line.rstrip()
        if line not in seen:  # if not seen already, write the lines to result
            deduped.append(line)
        seen.add(line)

# re-write the file with the de-duplicated lines
with open('test.txt', 'w') as f:
    f.writelines([l + '\n' for l in deduped])
rdas
  • 20,604
  • 6
  • 33
  • 46
1

You could use dict.fromkeys to remove duplicates and preserve order efficiently:

with open(filename, "r") as f:
    lines = dict.fromkeys(f.readlines())
with open(filename, "w") as f:
    f.writelines(lines)

Idea from Raymond Hettinger

ssp
  • 1,666
  • 11
  • 15
  • Works good, but how could I get the line number itself? 1. Hi 2. Bye 3. Bye How could I know exactly which line got removed? I'd like it to output 3 since it's the duplicated line – Panderex Sep 16 '22 at 15:24