Joining files by corresponding columns in outside table

Question

I have a .csv file matching table names to categories, which I want to use to merge any files in a folder (as in cat) with names corresponding to column Sample_Name in the .csv according to Category, changing the final file's name to each Category.

The to-be merged files in the folder are not .csv; they're a kind of .fasta file.

The .csv is something as the following (will have more columns that will be ignored for this):

 Sample_Name     Category
 1               a
 2               a
 3               a
 4               b
 5               b

After merging, the output should be two files: a (samples 1,2,3 merged) and b (samples 4 and 5).

The idea is to make this work for a large number of files and categories.

Thanks for any help!

Wish I did... I'm a begginer in python and have no idea how to start! — André Soares, Mar 01 '16 at 09:54

mhawke · Accepted Answer · 2016-03-01T10:27:07.357

1

Assuming that the files are in order in the input CSV file, this is about as simple as you could get:

from operator import itemgetter

fields = itemgetter(0, 1)    # zero-based field numbers of the fields of interest
with open('sample_categories.csv') as csvfile:
    next(csvfile)     # skip over header line
    for line in csvfile:
        filename, category = fields(line.split())
        with open(filename) as infile, open(category, 'a') as outfile:
            outfile.write(infile.read())

One downside to this is that the output file is reopened for every input file. This might be a problem if there are a lot of files per category. If that works out to be an actual problem then you could try this, which holds the output file open for as long as there are input files in that category.

from operator import itemgetter

fields = itemgetter(0, 1)    # zero-based field numbers of the fields of interest
with open('sample_categories.csv') as csvfile:
    next(csvfile)     # skip over header line
    current_category = None
    outfile = None
    for line in csvfile:
        filename, category = fields(line.split())
        if category != current_category:
            if outfile is not None:
                outfile.close()
            outfile = open(category, 'w')
            current_category = category
        with open(filename) as infile:
            outfile.write(infile.read())

edited Mar 01 '16 at 10:27

answered Mar 01 '16 at 10:05

mhawke

84,695
9
117
138

is there a way to select which columns to line.split? My input csv actually has a lot more columns – André Soares Mar 01 '16 at 10:09
@AndréSoares: yes. I just limited it to the first 2 columns as per your example. You could use `operator.itemgetter()`... I'll update the answer. – mhawke Mar 01 '16 at 10:25
@AndréSoares: updated to pluck fields from the line with `itemgetter`. – mhawke Mar 01 '16 at 10:28
Your first script works perfectly :) Thanks again! How would I name each output after its category? (e.g., a, b, and so on) – André Soares Mar 01 '16 at 10:39
@AndréSoares: Cool. It is already using the category as the output file name. – mhawke Mar 01 '16 at 10:42
Hello again! So sorry for coming back with new stuff, but would it be simple to get this to work for merging files based in their directories? This is, still taking the Sample_Name in account but using it to locate folders with that name and merging any ".fna" files inside with all others in folders corresponding to the Category. Thanks in advance! – André Soares Mar 02 '16 at 16:33
@AndréSoares: that wouldn't be too difficult. You can take a look at the [`os.walk`](https://docs.python.org/3/library/os.html#os.walk) function to locate the directories if there is some level of nesting, or possibly just the [`glob`](https://docs.python.org/3/library/glob.html#glob.glob) module might be easier. Try them out and if you have any problems ask a new question and include your code. – mhawke Mar 02 '16 at 21:04
Should this be done with a for loop from within the "with open(filename)" part? – André Soares Mar 02 '16 at 21:10
Something like 'with for xxxx: open (...)'? Not really sure about this, but will get get to you – André Soares Mar 02 '16 at 21:15

score 0 · Answer 2 · edited May 23 '17 at 12:23

I would build a dictionary with keys of categories and values of lists of corresponding sample names.

d = {'a':['1','2','3'], 'b':['4','5']}

You can achieve this in a straightforward way by reading the csv file and building the dictionary line by line, i.e.

d = {}
with open('myfile.csv'):
    for line in myfile.csv: 
        samp,cat = line.split()
        try: 
            d[cat].append(samp)
        except KeyError:           # if there is no entry for cat, we will get a KeyError
            d[cat] = [samp,]

For a more sophisticated way of doing this, have a look at collections.

Once this database is ready, you can create your new files from category to category:

for cat in d:
    with open(cat,'w') as outfile:
         for sample in d[cat]:
             # copy sample file content to outfile

Copying one file's content to the other can be done in several ways, see this thread.

Joining files by corresponding columns in outside table

2 Answers2

Linked