2

I have a .csv files with 9 columns. I need to get a list of the fifth column with no duplicates without using panda. The values in the column are product ID's, so things like "H0073456." There are over 1 million rows in the file. It is almost 4am and I'm getting sad. Help!

It seems like I need to generate a list format of the values in the column but I can't figure out how. Every time I tried I only ever successfully gotten the first value, or each character of the first value separated.

import csv

with open('myfile.csv', 'r') as f_the_file:

    reader = csv.reader(f_the_file)

    for row in reader:

        print(row[4])

This gives me the whole column but still includes the duplicates. How do i filter them out??

The expected result is a list of the values in the 5th column of the .csv file with no duplicates. Right now the output includes duplicate values.

Born Tbe Wasted
  • 610
  • 3
  • 13
Sam
  • 27
  • 2
  • A sample of `myfile.csv`? – DirtyBit Apr 08 '19 at 07:51
  • 1
    If you don't care about the ordering of the de-duped values, just construct a `set`. If you do care about the ordering, see https://stackoverflow.com/questions/1653970/does-python-have-an-ordered-set – NPE Apr 08 '19 at 07:54
  • 1
    How about `list(set(row[4]))`? – Chris Apr 08 '19 at 07:54
  • 1
    Possible duplicate of [How to create a list in Python with the unique values of a CSV file?](https://stackoverflow.com/questions/24441606/how-to-create-a-list-in-python-with-the-unique-values-of-a-csv-file) – Georgy Apr 08 '19 at 08:18

3 Answers3

2

You should take a look into the perfect object for you : set()

The set removes duplicates and let's you check if a value is in the set in O(1).

So your code should look like :

import csv
without_duplicates = set()
with open('myfile.csv', 'r') as f_the_file:

    reader = csv.reader(f_the_file)

    for row in reader:

        without_duplicates.add(row[4])
Born Tbe Wasted
  • 610
  • 3
  • 13
  • all of these answers seem like they work, but in the directions it says that i should have 13,000 values, and I am only getting 7,000. Could there be any reason why? I am trying to double check by counting the values in excel, but the file is so large it just keeps freezing my computer. – Sam Apr 08 '19 at 21:32
  • In the for-loop , append a list, just so you know how many were originaly in the file. If that number is less than 13K , than we have a problem ^^ – Born Tbe Wasted Apr 09 '19 at 16:41
0

Initialize an empty set and then add elements to the set. This way you only keep on adding unique elements to the set. After you finish reading the the file you can convert it to list if you need so.

import csv

productID = set()
with open('myfile.csv', 'r') as f_the_file:
    reader = csv.reader(f_the_file)
    for row in reader:
        productID.add(row[4])

productID_list = list(productID)
user3053452
  • 640
  • 1
  • 12
  • 38
0

You can just use a set comprehension for this:

import csv

with open('myfile.csv') as f:
    product_ids = {row[4] for row in csv.reader(f)}

If you absolutely need a list, just call product_ids = list(product_ids) afterwards.


If you need to conserve the original order, (leaving a value where it first appeared), you should use the itertools recipe unique_everseen (might require a lot of memory):

from itertools import filterfalse

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element
import csv

with open('myfile.csv') as f:
    product_ids = list(unique_everseen(row[4] for row in csv.reader(f)))
Community
  • 1
  • 1
Graipher
  • 6,891
  • 27
  • 47