Even though I'm very new with python, I can't understand how I haven't been able to solve this issue / take a right approach. So any help, link to a helpful tutorial is appreciated highly as I have to do this kind of stuff from time to time.
I have a CSV file that I need to reformat / modify a bit.
I need to store the amount of samples that the gene is in.
input file:
AHCTF1: Sample1, Sample2, Sample4
AHCTF1: Sample2, Sample7, Sample12
AHCTF1: Sample5, Sample6, Sample7
result:
AHCTF1 in 7 samples (Sample1, Sample2, Sample4, Sample5, Sample6, Sample7, Sample12)
code:
f = open("/CSV-sorted.csv")
gene_prev = ""
hit_list = []
csv_f = csv.reader(f)
for lines in csv_f:
#time.sleep(0.1)
gene = lines[0]
sample = lines[11].split(",")
repeat = lines[8]
for samples in sample:
hit_list.append(samples)
if gene == gene_prev:
for samples in sample:
hit_list.append(samples)
print gene
print hit_list
print set(hit_list)
print "samples:", len(set(hit_list))
hit_list = []
gene_prev = gene
So in a nutshell I'd like to combine the hits for every gene and make a set from them to remove duplications.
Maybe dictionary would be the way to do it:s ave gene as a key and add samples as values?
Found this - Similar / useful: How can I combine dictionaries with the same keys in python?