2

I have a list of 2000 files, which I would like to combine:

01-0628-11A-01D-0356-01_hg19
01-0628-11A-01D-0356-01_hg20
01-0628-11A-01D-0356-01_hg21
01-1372-11A-01D-0356-01_hg16
01-1372-11A-01D-0356-01_hg17
...


I have already gotten the files into a glob, and used regular expressions to rename the files into the common identifier (the six digit code shown below), however there is a varying number of original files for each identifier.

01-0628
01-0628
01-0628
01-1372
01-1372
...

I was originally going to just use a reader and open each file with common names, but I was wondering if there is a more efficient way to do this.

The final output I would like is the following, each of the files with common identifiers combined into one:

01-0628
01-1372
...

All of the files contain similarly formatted data, so simply apending the existing files to a new file would not be an issue.

victoryNap
  • 134
  • 10
  • It looks like you'll need to open each file anyway right? So I don't see how you could find any more efficient way to do that than having to open them one by one. – Benjamin Toueg Jun 30 '13 at 22:16
  • You can't just read through each file and open('newlargefile.csv', 'a') to a new large file, or first file? It'll just keep appending new lines to the bottom of the written file. Example: http://stackoverflow.com/questions/4706499/how-do-you-append-to-file-in-python – Adam Barthelson Jun 30 '13 at 22:31

1 Answers1

1

Assuming these csvs have similar or identical fields. This code should work. It uses the DictReader and DictWriter classes of the csv module which convert csv rows to Python dictionaries.

1) It opens and reads in the globbed csv files, from in_csv_dir, into a (filename, rows) dictionary.

2) It groups the csv rows into a (prefix, rows) dictionary based on filenames and the prefix_length variable.

3) It combines the fields of each prefix grouping and creates a combined csv in the out_csv_dir.

4) As dictionary keys are unordered, your csvs may have specific field orders. This can be entered into field_order. This will sort csv fields but won't fail on fields not defined in field_order.

import os
import sys

# Import System libraries
from csv import DictReader, DictWriter
import glob

in_csv_dir = ".\\csvs"
out_csv_dir = ".\\combined_csvs"
prefix_length = 2

field_order = ["NAME", "TITLE", "COMPANY", "LOCATION"]
field_check = lambda q: field_order.index(q) if(field_order.count(q)) else sys.maxint
csvs = {}
gotten_files = glob.glob(os.path.join(in_csv_dir, "*.csv"))
for glob_filename in gotten_files:
    print "%-11s%s" % ("Opening:", glob_filename)
    file_obj = open(glob_filename, "rb")
    cur_reader = DictReader(file_obj)
    cur_record = [q for q in cur_reader.__iter__()]
    file_obj.close()
    if(cur_record):
        (path, filename_ext) = os.path.split(glob_filename)
        (filename, ext) = os.path.splitext(filename_ext)
        csvs[filename] = cur_record

csv_prefixes = list(set([x[:prefix_length] for x in csvs.keys()]))
csv_groups = dict([(prefix, []) for prefix in csv_prefixes])
map(lambda (key, value): csv_groups[key[:prefix_length]].extend(value), csvs.items())

for (key, sub_csvs) in csv_groups.items():
    com_keys = list(reduce(lambda x, y: x|set(y.keys()), sub_csvs, set([])))
    com_keys.sort(cmp=lambda x, y: field_check(x) - field_check(y))

    filename = os.path.join(out_csv_dir, "%s.csv" % key)
    print "%-11s%s" % ("Combining:", filename)
    file_obj = open(filename, "wb")
    temp_csv = DictWriter(file_obj, com_keys)

    temp_csv.writerow(dict(zip(com_keys, com_keys)))
    map(lambda x: temp_csv.writerow(x), sub_csvs)
    file_obj.close()
dilbert
  • 3,008
  • 1
  • 25
  • 34