0

this type of question has been asked many times. So apologies; I have searched hard to get an answer - but have not found anything that is close enough to my needs (and I am not sufficiently advanced (I am a total newbie) to customize an existing answer). So thanks in advance for any help.

Here's my query:

  • I have 30 or so csv files and each contains between 500 and 15,000 rows.
  • Within each of them (in the 1st column) - are rows of alphabetical IDs (some contain underscores and some also have numbers).
  • I don't care about the unique IDs - but I would like to identify the duplicate IDs and the number of times they appear in all the different csv files.
  • Ideally I'd like the output for each duped ID to appear in a new csv file and be listed in 2 columns ("ID", "times_seen")

It may be that I need to compile just 1 csv with all the IDs for your code to run properly - so please let me know if I need to do that

I am using python 2.7 (a crawling script that I run needs this version, apparently).

Thanks again

martineau
  • 119,623
  • 25
  • 170
  • 301
JasonC
  • 27
  • 5
  • Are IDs that are the same except for underscores considered duplicate? Also have same question about those that have numbers and those that don't... – martineau Feb 13 '17 at 16:04
  • If you just want a histogram of the IDs, a one-line awk script might fit the bill. Do you need to use python? – gregory Feb 13 '17 at 17:08

4 Answers4

3

It seems the most easy way to achieve want you want would make use of dictionaries.

import csv
import os
# Assuming all your csv are in a single directory we will iterate on the 
# files in this directory, selecting only those ending with .csv

# to list files in the directory we will use the walk function in the 
# os module. os.walk(path_to_dir) returns a generator (a lazy iterator)
# this generator generates tuples of the form root_directory, 
# list_of_directories, list_of_files. 
# So: declare the generator
file_generator = os.walk("/path/to/csv/dir")
# get the first values, as we won't recurse in subdirectories, we
# only ned this one
root_dir, list_of_dir, list_of_files = file_generator.next()
# Now, we only keep the files ending with .csv. Let me break that down
csv_list = []
for f in list_of_files:
    if f.endswith(".csv"):
        csv_list.append(f)
# That's what was contained in the line
# csv_list = [f for _, _, f in os.walk("/path/to/csv/dir").next() if f.endswith(".csv")]

# The dictionary (key value map) that will contain the id count.
ref_count = {}
# We loop on all the csv filenames...
for csv_file in csv_list:
    # open the files in read mode
    with open(csv_file, "r") as _:
        # build a csv reader around the file
        csv_reader = csv.reader(_)
        # loop on all the lines of the file, transformed to lists by the 
        # csv reader
        for row in csv_reader:
            # If we haven't encountered this id yet, create 
            # the corresponding entry in the dictionary.
            if not row[0] in ref_count:
                ref_count[row[0]] = 0
            # increment the number of occurrences associated with
            # this id
            ref_count[row[0]]+=1
# now write to csv output
with open("youroutput.csv", "w") as _:
    writer = csv.writer(_)
    for k, v in ref_count.iteritems():
        # as requested we only take duplicates
        if v > 1:
            # use the writer to write the list to the file
            # the delimiters will be added by it.
            writer.writerow([k, v])

You may need to tweek a little csv reader and writer options to fit your needs but this should do the trick. You'll find the documentation here https://docs.python.org/2/library/csv.html. I haven't tested it though. Correcting the little mistakes that may have occurred is left as a practicing exercise :).

Anis
  • 2,984
  • 17
  • 21
  • He is a newbie, he probably won't understand a piece of code with no comments (don't tell me it is commented) There is a built-in class that already does the counting for you, called `counter`, you are reinventing the wheel – Adirio Feb 13 '17 at 16:11
  • As you said, he is a newbie, that is why I provided a solution that is both incomplete (some tweeks needed, and therefore necessiting a bit of reflection on the code) and explicit. There is not too much pythonic sugar and/or external library. Without being a python expert it seems to me that the parts of the code are quite self-explanatory, and the steps broken down. – Anis Feb 13 '17 at 16:15
  • `with` is pythonic, list comprehensions are pythonic, including file system instructions inside a list comprehension is quite hard to understand for newbies, having two comments in all that code is by no means breaking down the steps, it may be enough for you to understand it later, but not enough as documentation, and checking if v is greater than 1 is not nedt – Adirio Feb 13 '17 at 16:20
  • It is needed since OP's request clearly specifies "Ideally I'd like the output for **each duped ID** to appear in a new csv file and be listed in 2 columns". As to the comment on the code, I admit that the list comprehension might be tortuous for a newbie. I didn't really expand on that since in was not the heart of the question, and since OP's didn't provide any information as to how his csv files were organised. – Anis Feb 13 '17 at 16:23
  • How do you like the response now with the added comments? – Anis Feb 13 '17 at 16:30
  • Downvote removed, still think that you should use `counter` and not do it manually – Adirio Feb 13 '17 at 16:32
  • many thanks to all of you guys - very helpful - esp @Anis. The only part of the code I don;t understand is csv_list = [f for _, _, f in os.walk in line 4. And on running it - it has thrown up an error (a Value Error: too many values to unpack). Any further help greatly appreciated – JasonC Feb 13 '17 at 21:50
  • I've broken it down for you. I hope everything is clear now. The argument to `os.walk()` should be the root path of the directory containing your csv. I suggest that you investigate the use of `counter` as @Adirio mentionned and rewrite the corresponding part of the code to get familiar with it. Don't forget to mark the answer as accepted :). – Anis Feb 14 '17 at 09:19
  • I actually gave you an example with `Counter` in my answer (nearlly at the bottom, as someone downvoted without posting why, probably just because I downvoted his answer but I did explain why and when solver removed the downvote) – Adirio Feb 14 '17 at 09:27
  • @Anis - you're a genius. Many thanks - works like a dream. and with the comments - I can easily follow the steps (and look up what I don't quite understand). Very grateful indeed. This has really helped my understanding of python. – JasonC Feb 14 '17 at 13:45
  • I'm glad it helped! Don't forget to mark the answer as accepted if you feel it provided you with all you needed to solve your problem :). – Anis Feb 14 '17 at 15:34
0

That's rather easy to achieve. It would look something like:

import os

# Set to what kind of separator you have. '\t' for TAB
delimiter = ','

# Dictionary to keep count of ids
ids = {}

# Iterate over files in a dir
for in_file in os.listdir(os.curdir):
    # Check whether it is csv file (dummy way but it shall work for you)
    if in_file.endswith('.csv'):
        with open(in_file, 'r') as ifile:
            for line in ifile:
                my_id = line.strip().split(delimiter)[0]
                # If id does not exist in a dict = set count to 0
                if my_id not in ids:
                    ids[my_id] = 0
                # Increment the count
                ids[my_id] += 1

# saves ids and counts to a file
with open('ids_counts.csv', 'w') as ofile:
    for key, val in ids.iteritems():
        # write down counts to a file using same column delimiter
        ofile.write('{}{}{}\n'.format(key, delimiter, value))
Mateusz Korycinski
  • 1,037
  • 3
  • 10
  • 24
  • Read my comments on @Anis answer, as they also apply here. You really printed the output into the console after using csv module to read? Use it to write to then – Adirio Feb 13 '17 at 16:13
  • @Adirio, I didn't use csv module to read the file. Indeed maybe I am reinventing the wheel, however since OP is new to Python I thought such a way can show him how certain things can be done. There are always many ways to solve each problem. – Mateusz Korycinski Feb 13 '17 at 16:43
  • Oh, true, was so similar to the other answer that I didn't even read the imports. Downvote removed – Adirio Feb 14 '17 at 07:32
-1

Check out the pandas package. You can read an write csv files quite easily with it.

http://pandas.pydata.org/pandas-docs/stable/10min.html#csv

Then, when having the csv-content as a dataframe you convert it with the as_matrix function. Use the answers to this question to get the duplicates as a list.

Find and list duplicates in a list?

I hope this helps

Community
  • 1
  • 1
MEVIS3000
  • 521
  • 3
  • 18
  • Suggesting `pandas` for every simple script is like asking everyone to buy a car to get from his house to his neighbours house. – Adirio Feb 13 '17 at 16:08
-1

As you are a newbie, Ill try to give some directions instead of posting an answer. Mainly because this is not a "code this for me" platform.

Python has a library called csv, that allows to read data from CSV files (Boom!, surprised?). This library allows you to read the file. Start by reading the file (preferably an example file that you create with just 10 or so rows and then increase the amount of rows or use a for loop to iterate over different files). The examples in the bottom of the page that I linked will help you printing this info.

As you will see, the output you get from this library is a list with all the elements of each row. Your next step should be extracting just the ID that you are interested in.

Next logical step is counting the amount of appearances. There is also a class from the standard library called counter. They have a method called update that you can use as follows:

from collections import Counter
c = Counter()
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 1})
c['safddsfasdf'] # 1
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 2})
c['safddsfasdf'] # 2
c.update(['fdf'])
c # Counter({'safddsfasdf': 2, 'fdf': 1})
c['fdf'] # 1

So basically you will have to pass it a list with the elements you want to count (you could have more than 1 id in the list, for exampling reading 10 IDs before inserting them, for improved efficiency, but remember not constructing a thousands of elements list if you are seeking good memory behaviour).

If you try this and get into some trouble come back and we will help further.

Edit

Spoiler alert: I decided to give a full answer to the problem, please avoid it if you want to find your own solution and learn Python in the progress.

# The csv module will help us read and write to the files
from csv import reader, writer
# The collections module has a useful type called Counter that fulfills our needs
from collections import Counter

# Getting the names/paths of the files is not this question goal,
# so I'll just have them in a list
files = [
    "file_1.csv",
    "file_2.csv",
]

# The output file name/path will also be stored in a variable
output = "output.csv"

# We create the item that is gonna count for us
appearances = Counter()

# Now we will loop each file
for file in files:
    # We open the file in reading mode and get a handle
    with open(file, "r") as file_h:
        # We create a csv parser from the handle
        file_reader = reader(file_h)

        # Here you may need to do something if your first row is a header

        # We loop over all the rows
        for row in file_reader:
            # We insert the id into the counter
            appearances.update(row[:1])
# row[:1] will get explained afterwards, it is the first column of the row in list form

# Now we will open/create the output file and get a handle
with open(output, "w") as file_h:
    # We create a csv parser for the handle, this time to write
    file_writer = writer(file_h)

    # If you want to insert a header to the output file this is the place

    # We loop through our Counter object to write them:
    # here we have different options, if you want them sorted
    # by number of appearances Counter.most_common() is your friend,
    # if you dont care about the order you can use the Counter object
    # as if it was a normal dict

    # Option 1: ordered
    for id_and_times in apearances.most_common():
        # id_and_times is a tuple with the id and the times it appears,
        # so we check the second element (they start at 0)
        if id_and_times[1] == 1:
            # As they are ordered, we can stop the loop when we reach
            # the first 1 to finish the earliest possible.
            break
        # As we have ended the loop if it appears once,
        # only duplicate IDs will reach to this point
        file_writer.writerow(id_and_times)

    # Option 2: unordered
    for id_and_times in apearances.iteritems():
        # This time we can not stop the loop as they are unordered,
        # so we must check them all
        if id_and_times[1] > 1:
            file_writer.writerow(id_and_times)

I offered 2 options, printing them ordered (based on Counter.most_common() doc) and unoredered (based on normal dict method dict.iteritems()). Choose one. From a speed point of view I'm not sure which one would be faster, as one first needs to order the Counter but also stops looping when finding the first element non-duplicated while the second doesn't need to order the elements but needs to loop every ID. The speed will probably be dependant on your data.

About the row[:1] thingy:

  • row is a list
  • You can get a subset of a list telling the initial and final positions
  • In this case the initial position is omited, so it defaults to the start
  • The final position is 1, so just the first element gets selected
  • So the output is another list with just the first element
  • row[:1] == [row[0]] They have the same output, getting a sublist of only the same element is the same that constructing a new list with only the first element
Adirio
  • 5,040
  • 1
  • 14
  • 26
  • @Adino - thanks for your help. In your code - is 'safddsfasdf' the filepath? And 'fdf' the output? – JasonC Feb 13 '17 at 21:52
  • No, 'safddsfasdf' and 'fdf' are 2 IDs, the code is just to show you how a `Counter` object looks like. I didn't want to give you a complete solution so that you could learn, but seeing they did give you the whole thing, I'll add it to my answer. If you don't want to get spoiled and prefer to do it yourself, please don't look at the last piece of code in my answer. – Adirio Feb 14 '17 at 07:35
  • @Adino - I really appreciate this help. I obv have a long way to go with python - but your comments will help me to unpick the code (and to learn). Thank you – JasonC Feb 14 '17 at 13:47
  • You are welcome, you should select the answer that better answers the question you asked as the selected one so that ir appears at the top. By the way, my nick is not adiNo, it is adiRIo, an r and an i, not an n – Adirio Feb 14 '17 at 13:52
  • @Adiro - please accept my apologies! Bad eyesight (too many hours in front of a screen!). Thanks again – JasonC Feb 14 '17 at 14:43
  • Haha actually no, A d i r i o, you were still measing a letter. No need to keep this talk for longer though. – Adirio Feb 16 '17 at 13:49