1

I can't find how to remove duplicate rows based on column 2. I looked at the documentation for the csv module but nothing I could see to implement.

My current output for list-history.csv:

Number,Keywords
5,banana
8,apple
Number,Keywords
5,banana
Number,Keywords
5,banana
8,apple

Desired output:

Number,Keywords
5,banana
8,apple

And appending for new entries to the desired ouput.

I tried another way but this is the closest I found which doesn't mention column 2. I don't really know what to do from this point.

with open("list-history.csv", "r") as f:
    lines = f.readlines()

with open("list-history.csv", "a", encoding="utf8") as f:
    reader = csv.reader(f)
    header = next(reader)
    for line in reader:
        if line.strip("\n") == "Number,Keywords":
            f.write(line)

But this code doesn't remove other duplicates within the whole column 2. I just want to keep the header once and no duplicates beyond. My constraint is to keep data to come in from file1 to file2, the latter being the one about the code above.

=== SOLVED ISSUE =======

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line in seen: continue # skip duplicate

seen.add(line)
print(line, end='')

Removing duplicate rows from a csv file using a python script

timotom
  • 75
  • 2
  • 11

4 Answers4

0

I don't know if you're allowed to use other modules besides csv for your task; but if you do, you can solve this with pandas.

import pandas as pd
df = pd.read_csv('list-history.csv')
df = df.drop_duplicates(subset=['Keywords'], keep='first')
print(df)
  • Technically, I could. But it's more about being consistent in the use of one module only that I started working with. But your input is certainly valuable. Thanks. – timotom Dec 20 '19 at 02:35
0

dropping ALL duplicate values

data.drop_duplicates(subset ="Keywords",keep = False, inplace = True) 
freak7
  • 99
  • 12
0

You could do it in two steps as shown below. The first step read the lines of the file into a collections.OrderedDict which will automatically keep duplicates out of it.

The second step simply overwrites the file with the keys of this dictionary.

from collections import OrderedDict
import csv

# 1. Read file into an OrderedDict which automatically removes any duplicates.
with open("list-history.csv", "r") as file:
    temp_dict = OrderedDict.fromkeys(line.strip() for line in file)

# 2. Rewrite file.
with open("list-history.csv", "w", newline='') as file:
    writer = csv.writer(file)
    for row in csv.reader(temp_dict):
        writer.writerow(row)

In Python 3.7+ you can use a regular dictionary because beginning in that version, they also maintain order.

martineau
  • 119,623
  • 25
  • 170
  • 301
0

You need to keep a set of lines that you have seen so far, and you do not even need a CSV reader:

with open("list-history.csv") as infile,
     open("list-history-copy.csv", "w", encoding="utf8") as outfile:
    lines = set()
    for line in infile:
        if line not in lines:
            data.add(lines)
            outfile.writeline(line + "\n")
DYZ
  • 55,249
  • 10
  • 64
  • 93