0

How can import a csv file and identify duplicate value in a column? I have to compare each name with other all the names in the same column and identify if there are any duplicate records.

fruits_name_field_num1 = 0
fruits_name_field_num2 = 0

#from collections import Counter 

rowCount = 0 
fruits_name1 = ''
save_fruits_name1 = ''

for line in readRecord: 
    rowCount += 1 
    row_number = str(rowCount)
    print(rowCount) 
    save_fruits_name = fruits_name
    fruits_name = (str(line[fruits_name_field_num])) 
    save_fruits_name = fruits_name 
    if fruits_name == save_fruits_name: 
        print('same_fruits_name') 
    else: 
        print('different_fruits_name')
Charles
  • 3,116
  • 2
  • 11
  • 20
  • Welcome to SO! What have you tried so far? Could you please post some of your code, as well as your desired output? – Hayden Y. Aug 16 '19 at 17:11
  • fruits_name_field_num1 = 0 fruits_name_field_num2 = 0#from collections import Counter rowCount = 0 fruits_name1 = '' save_fruits_name1 = '' for line in readRecord: rowCount += 1 row_number = str(rowCount) #print(rowCount) #save_fruits_name = fruits_name fruits_name = (str(line[fruits_name_field_num])) save_fruits_name = fruits_name if fruits_name == save_fruits_name: print('same_fruits_name') else: print('different_fruits_name') – zarinmosfaka Aug 16 '19 at 17:36
  • This is the mock code. I can't share anything from my project. However, the idea is same. I don't want the same fruits name twice. – zarinmosfaka Aug 16 '19 at 17:38
  • post the code in your question please. – allenski Aug 16 '19 at 17:47
  • Well, first of all don't import as Dict, import as list and you should have a first row to be your headers. From here you should be able to identify the duplicate column names and act accordingly. – GSazheniuk Aug 16 '19 at 18:22
  • 1
    Possible duplicate of [python pandas remove duplicate columns](https://stackoverflow.com/questions/14984119/python-pandas-remove-duplicate-columns) – BPDESILVA Aug 16 '19 at 18:37

2 Answers2

1

You can do this easily with Pandas. Example:

if you have a data frame that looks like:

       a      b
0    Bob  Sarah
1   Rick  Sarah
2  Steve   Rick
3    Bob   Matt
4    Ben    Ben
5  Steve    Bob

and you want to just find the duplicate names that appear in column 'a' of this df. You can do:

df_duplicates = df[df.duplicated('a')]['a']
print(df_duplicates)

Which outputs:

3      Bob
5    Steve

full code:

import pandas as pd

df = pd.read_csv('something.csv')
print(df)
df_duplicates = df[df.duplicated('a')]['a']
print(df_duplicates)
Anna Nevison
  • 2,709
  • 6
  • 21
1

If you want to identify the duplicate column names, the simplest way to do it is one line:

df = df.loc[:, ~df.columns.duplicated()]

df.columns.duplicated() returns a boolean mask that is True for every column that is duplicated (so not the first occurrence, but all subsequent ones). The tilda (~) inverses the boolean mask, such that it's True only for the first occurrence of each. Finally the .loc[] selects only those occurences where the mask is True.


If within a column you want to see which values occur multiple times, you can use:

dupes = {}
for col in df.columns:
    dupes[col] = df.duplicated(subset=col)
Charles
  • 3,116
  • 2
  • 11
  • 20
  • Thanks. However my question was how can I prevent printing – zarinmosfaka Aug 16 '19 at 21:04
  • Thanks. However my question was how can I prevent printing a 0 Bob 1 Rick 2 Steve 3 Bob 4 Ben 5 Steve From here Bob and Steve will display one time. there will be no duplicate. I can't import pandas it's showing me error. I am working with pycharm. thanks – zarinmosfaka Aug 16 '19 at 21:13