0

I have about 750 files (.csv) and each line has one entry which is a UUID. My goal for this script to to count how many unique UUIDs exist across all 750 or so files. The file name structure looks like the following:

DATA-20200401-005abf4e3f864dcb83bd9030e63c6da6.csv

As you can see, it has a date and some random id. They're all in the same directory and they all have the same file extension. The format of each file is new line delimited and just has a UUID that looks like the following: b0d6e1e9-1b32-48d5-b962-671664484616

I tried merging all the files, but things got messy and this is about 15GB worth of data.

My final goal is to get an output such that it states the number of unique IDs across all the files. For example:

file1:

xxx-yyy-zzz
aaa-bbb-ccc
xxx-yyy-zzz

file2:

xxx-yyy-zzz
aaa-bbb-ccc
xxx-yyy-zzz

The final output after scanning these two files would be:

The total number of unique ids is: 2

martineau
  • 119,623
  • 25
  • 170
  • 301
Ryan
  • 141
  • 1
  • 10
  • 1
    You can try reading a single line from a file and add it to set. Do this for all files. This should work if you have less number of unique IDs. Not the best method but should work for your case. It's definitely better than loading all files at once. – Poojan Apr 02 '20 at 00:50
  • 3
    Do you have to use Python? A simple solution might be `cat *.csv | sort | uniq | wc -l` – jarmod Apr 02 '20 at 00:50
  • I would add the ID's to a sorted list one at a time, if the ID already exists in the list then don't add it. The total number of unique ID's would be the length of the completed list. – Erik McKelvey Apr 02 '20 at 00:51
  • 2
    @jarmod: Replace `sort | uniq` with just `sort -u` to let `sort` do the uniquification as well (which can run a bit faster by reducing the size of the sort working set). – ShadowRanger Apr 02 '20 at 00:54
  • I don't need Python! Haha, this is great. Once files are fully transferred from S3 to local, I will run with this and give it a shot. So `cat *.csv | sort -u | wc -l` ? – Ryan Apr 02 '20 at 00:58
  • Also, will this load everything into memory? I'm dealing with just a local macbook here. – Ryan Apr 02 '20 at 00:59
  • @ShadowRanger I obviously need to run `man` more often. Thanks, good info. – jarmod Apr 02 '20 at 00:59
  • @Ryan: `sort` will spill to disk if the data exceeds the memory buffer size (which can be set manually with a command line argument; check the man page). Since `sort -u` removes duplicates eagerly, it typically reduces the need for spilling to disk. – ShadowRanger Apr 02 '20 at 01:03
  • Thanks for the info. It's running now. I see memory usage at ~9GB – Ryan Apr 02 '20 at 01:04
  • @ShadowRanger You think this will take a few hours or be relatively quick? – Ryan Apr 02 '20 at 01:10
  • @Ryan: Depends on the ratio of RAM to unique values, how often the unique values are found within a buffer's proximity of each other, etc. Total runtime, worst case, should be proportionate to reading and writing 15 GB of data twice (the actual sorting being less expensive than the I/O), so I'd guess a few minutes. – ShadowRanger Apr 02 '20 at 01:15
  • An FYI on the download from S3 to local then running the script ... you pay for data transfer out of S3 to your laptop at $0.09 per GB so about $1.35 but you don't pay for data transfer to EC2 (in the same region). And a 32GB EC2 instance costs about $0.30 per hour, pro-rated down to the second (min 60 seconds). EC2 is also a lot closer to S3 than you are, so download speeds to EC2 are much higher. – jarmod Apr 02 '20 at 01:17
  • @jarmod thanks, I was actually curious about that. Hoenstly, companies S3 bucket that does TB per day, so I wasn't super worried at this point. But thanks for knowledge. – Ryan Apr 02 '20 at 01:18
  • @ShadowRanger hm. It's still running, you think that's normal? Been 15 minutes now. Just want to make sure that's probably normal. – Ryan Apr 02 '20 at 01:19
  • Been running for roughly 40 minutes :( Haven't seen any output yet – Ryan Apr 02 '20 at 01:39
  • This is 750 million lines by the way @ShadowRanger – Ryan Apr 02 '20 at 01:53
  • Did it complete? Are you good? – jarmod Apr 02 '20 at 11:56
  • I got it, thanks! I think it took like 8 hours – Ryan Apr 02 '20 at 19:03

5 Answers5

2

I reckon using a Counter may be the fastest way to do this:

from collections import Counter

with open(filename) as f:
    c = Counter(f)
    print(sum(c.values()))

The counter provides the count of each unique item. This is implemented using a hashtable so should be fairly quick with a large number of items.

Jack Casey
  • 1,628
  • 11
  • 18
  • 2
    Don't use `.readlines()`, just do `Counter(f)`. `.readlines()` forces you to load the entire file as a `list` all at once; `Counter(f)` will lazily read and count lines (keeping memory cost proportionate to number of unique items, not number of total items). `Counter` is only needed if you need the duplication counts though; sounds like they just need it uniquified, with the counts irrelevant, so a `set` would be fine. – ShadowRanger Apr 02 '20 at 01:16
  • 1
    Ah that's a good point. I was interpreting the question as we needed to know how many of each unique item we had, not just the total count of unique. In this case a set does make more sense :) – Jack Casey Apr 02 '20 at 01:23
  • Can you modify by chance to go through all the files? – Ryan Apr 02 '20 at 01:40
  • 1
    @Ryan Perhaps search up how to read multiple files. This is a chance for you to self-learn :) Or instead if you can use bash, as you've mentioned in the comments you can use `cat *.csv | this.py`. Modify this code so that `f.readlines()` reads from the input instead. – Gareth Ma Apr 02 '20 at 02:17
1

Maybe something like this would work:

from os import listdir
import re
import pandas as pd

my_folder_path = "C:\\\\"

# Generic regular expression
pat = r"DATA-\d{8}-.+\.csv}"
p = re.compile(pat)

# UUID column in each file (I don't know if this is the case; Adjust accodingly.
uuid_column = "uuids"

# Empty result dataframe with single column
result_df = pd.DataFrame(columns=["unique_uuid"])

file_list = [rf"{my_folder_path}\{i}" for i in listdir(my_folder_path)]
for f in file_list:
    # Check for matching regular expression pattern
    if p.search(f):

        # Read file if pattern matches.
        df = pd.read_csv(f, usecols=[uuid_column])

        # Append only unique values from the new Series to the dataframe
        (result_df["unique_uuid"]
        .append(list(set(df[uuid_column].values)
        .difference(result_df["unique_uuid"].values)))
        )
Mark Moretto
  • 2,344
  • 2
  • 15
  • 21
1

If you don't have to use Python, then a simple solution might be the command line:

cat *.csv | sort -u | wc -l   

This pipes the content of all of the CSV file into sort -u which sorts and removes duplicates, then pipes that into wc -l which does a line count.

Note: sort will spill to disk as needed, and you can control its memory usage with -S size if you like.

I'd be tempted to run this on a powerful machine with lots of RAM.

jarmod
  • 71,565
  • 16
  • 115
  • 122
0

Concatenating all of the csv files in a directory has been solved in a pretty popular post The only difference here is that you drop duplicates. This would of course work well only if there are a significant amount of duplicates in each file (at least enough for all of the deduped frames to fit into memory and perform the final drop_duplicates).

There are also some other suggestions in that link, such as skipping the list altogether.

import glob
import pandas as pd

files = glob.glob('./data_path/*.csv')

li = []

for file in files:
    df = pd.read_csv(file, index_col=None, header=None)
    li.append(df.drop_duplicates())

output = pd.concat(li, axis=0, ignore_index=True)
output = output.drop_duplicates()
Chris
  • 15,819
  • 3
  • 24
  • 37
-1

Read all the files and add all the UUIDs to a set as you go. Sets enforce uniqueness, so the length of the set is the number of unique UUIDs you found. Roughly:

import csv
import os

uuids = set()
for path in os.listdir():
    with open(path) as file:
        for row in csv.reader(file):
            uuids.update(row)
print(f"The total number of unique ids is: {len(uuids)}")

This assumes that you can store all the unique UUIDs in memory. If you can't, building a database on disk would be the next thing to try (e.g. replace the set with a sqlite db or something along those lines). If you had a number of unique IDs that's too large to store anywhere, there are still solutions as long as you're willing to sacrifice some precision: https://en.wikipedia.org/wiki/HyperLogLog

Samwise
  • 68,105
  • 3
  • 30
  • 44