Even thought this might sound as a repeated question, I have not found a solution. Well, I have a large .csv file that looks like:
prot_hit_num,prot_acc,prot_desc,pep_res_before,pep_seq,pep_res_after,ident,country
1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPV,L,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPVL,D,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],L,SSISGAGGGGLA,L,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],D,NYDNSAGKW,W,F40,EB
....
The aim is to slice this .csv file into multiple smaller .csv files according to the last two columns ('ident' and 'country').
I have used a code from an answer in a previous post and is the following:
csv_contents = []
with open(outfile_path4, 'rb') as fin:
dict_reader = csv.DictReader(fin) # default delimiter is comma
fieldnames = dict_reader.fieldnames # save for writing
for line in dict_reader: # read in all of your data
csv_contents.append(line) # gather data into a list (of dicts)
# input to itertools.groupby must be sorted by the grouping value
sorted_csv_contents = sorted(csv_contents, key=op.itemgetter('prot_desc','ident','country'))
for groupkey, groupdata in it.groupby(sorted_csv_contents,
key=op.itemgetter('prot_desc','ident','country')):
with open(outfile_path5+'slice_{:s}.csv'.format(groupkey), 'wb') as fou:
dict_writer = csv.DictWriter(fou, fieldnames=fieldnames)
dict_writer.writerows(groupdata)
However, I need that my output .csv's just contain the column 'pep_seq', a desired output like:
pep_seq
ANSPV
ANSPVL
SSISGAGGGGLA
NYDNSAGKW
What can I do?