How to select unique values from named column in multiple .csv files?

Question

I am trying to create a list of unique ID's from multiple csvs.

I have around 80 csvs containing data, all in the same format and in the same directory. The files contain time series data from around 1500 sites, but not all sites are in all files. The column with the data I need is called 'Site Id'. I can get unique values from the first csv by creating a dataframe, but I can't see how to loop through all the remaining files.

If it's not obvious by now I am a complete beginner and my tutors are on vacation!

I've tried creating a df for a single file, but I can't figure out the next step.

df = pd.read_csv(r'C:filepathhere.csv')
ids = df['Site Id'].unique().tolist()

Thanks for asking. A list would be the best option I think. I plan on using the list as the basis for spilting each of the csvs by site, so instead of having 80 files of monthly data for all sites I'll have 1500 files of all data for individual sites. — ClaryT, Apr 18 '19 at 16:45
Gotcha, I put an answer avoiding using Pandas which is a heavier package, if you don't need to use it already or afterwards. If you are using it anyway, check out the other methods posted which are similar to what you attempted. — MyNameIsCaleb, Apr 18 '19 at 17:04

hostingutilities.com · Answer 1 · 2019-04-19T00:56:13.507

You can do something like this. I used the os.listdir function to get all of the files, and then the list.extend to merge the site IDs I was coming across into my siteIDs list. Finally, turning a list into a set, and then back into a list will remove any duplicate entries.

siteIDs = []
directoryToCSVs = r'c:\...'

for filename in os.listdir(directoryToCSVs):
    if filename.lower().endswith('.csv'):
         df = pd.read_csv(r'C:filepathhere.csv')
         siteIDs.extend( df['Site Id'].tolist() )

#remove duplicate site IDs
siteIDs = list(set(siteIds))


#siteIDs will now contain a list of the unique site IDs across all of your CSV files.

score 0 · Answer 2 · answered Apr 18 '19 at 16:03

You could do something like this to iterate over all your CSVs and load them into dataframes:

from os import walk, path
import pandas as pd

path = 'Path to CSV dir'

csv_paths = []
for root, dirs, files in walk(path):
    for c in glob(path.join(root, '*.csv')):
        csv_paths.append(c)


for file_path in csv_paths:
    df = pd.read_csv(filepath_or_buffer=file_path)
    # do something with df (append, export, etc.)

MyNameIsCaleb · Answer 3 · 2019-04-18T16:47:02.783

First you need to gather the files into a list that you will be getting data out of. There are many ways to do this, assuming you know the directory they are all in, see this answer for many options.

from os import walk

f = []
for (dirpath, dirnames, filenames) in walk(mypath):
    f.extend(filenames)
    break

Then within that list you'll need to gather those unique values that you need. Without using Pandas, since it doesn't seem like you actually need your information in a dataframe:

import csv

unique_data = {}
for file in f:
    with open(file, 'rU') as infile:
        reader = csv.DictReader(infile)
        for row in reader:
            # go through each, add value to dictionary
            for header, value in row.items():
                unique_data[value] = 0

# unqiue_data.keys() is now your list of unique values, if you want a true list
unique_data_list = list(unqiue_data.keys())

How to select unique values from named column in multiple .csv files?

3 Answers3