1

Even though I'm very new with python, I can't understand how I haven't been able to solve this issue / take a right approach. So any help, link to a helpful tutorial is appreciated highly as I have to do this kind of stuff from time to time.

I have a CSV file that I need to reformat / modify a bit.

I need to store the amount of samples that the gene is in.

input file:

AHCTF1: Sample1, Sample2, Sample4
AHCTF1: Sample2, Sample7, Sample12
AHCTF1: Sample5, Sample6, Sample7

result:

 AHCTF1 in 7 samples (Sample1, Sample2, Sample4, Sample5, Sample6, Sample7, Sample12)

code:

f = open("/CSV-sorted.csv")
gene_prev = ""

hit_list = []

csv_f = csv.reader(f)

for lines in csv_f:

    #time.sleep(0.1)
    gene = lines[0]
    sample = lines[11].split(",")
    repeat = lines[8]

    for samples in sample:
        hit_list.append(samples)

    if gene == gene_prev:

        for samples in sample:

            hit_list.append(samples)

        print gene
        print hit_list
        print set(hit_list)
        print "samples:", len(set(hit_list))


    hit_list = []

    gene_prev = gene

So in a nutshell I'd like to combine the hits for every gene and make a set from them to remove duplications.

Maybe dictionary would be the way to do it:s ave gene as a key and add samples as values?

Found this - Similar / useful: How can I combine dictionaries with the same keys in python?

Community
  • 1
  • 1
jester112358
  • 465
  • 3
  • 17

1 Answers1

1

The standard way to remove duplicates is to convert to a set.

However I think there's some stuff wrong with the way you're reading the file. First problem: it isn't a csv file (you have a colon between the first two fields). Second what is

gene = lines[0]
sample = lines[11].split(",")
repeat = lines[8]

supposed to do?

If I was writing this I would replace the ":" with another ",". So with this modification and using a dictionary of sets your code would look something like:

# Read in csv file and convert to list of list of entries. Use with so that 
# the file is automatically closed when we are done with it
csvlines = []
with open("CSV-sorted.csv") as f:
    for line in f:
        # Use strip() to clean up trailing whitespace, use split() to split
        # on commas.
        a = [entry.strip() for entry in line.split(',')]
        csvlines.append(a)

# I'll print it here so you can see what it looks like:
print(csvlines)



# Next up: converting our list of lists to a dict of sets.

# Create empty dict
sample_dict = {}

# Fill in the dict
for line in csvlines:
    gene = line[0] # gene is first entry
    samples = set(line[1:]) # rest of the entries are samples

    # If this gene is in the dict already then join the two sets of samples
    if gene in sample_dict:
        sample_dict[gene] = sample_dict[gene].union(samples)

    # otherwise just put it in
    else:
        sample_dict[gene] = samples


# Now you can print the dictionary:
print(sample_dict)

The output is:

[['AHCTF1', 'Sample1', 'Sample2', 'Sample4'], ['AHCTF1', 'Sample2', 'Sample7', 'Sample12'], ['AHCTF1', 'Sample5', 'Sample6', 'Sample7']]
{'AHCTF1': {'Sample12', 'Sample1', 'Sample2', 'Sample5', 'Sample4', 'Sample7', 'Sample6'}}

where the second line is your dictionary.

dshepherd
  • 4,989
  • 4
  • 39
  • 46
  • Hi & thanks. This looks really nice! Sorry my example was poor, but there is nothing wrong with the input file. In the input file the gene is located in the first column and samples (comma separated -> splitting) that has it, in the 12th column. Repeat is not doing anything currently, but I left it since I need it later. – jester112358 Aug 26 '14 at 06:11
  • Ah ok, do you understand how to do it now though? (basically just replace the first half of my code with the first half of yours) – dshepherd Aug 26 '14 at 09:47
  • Late reply :) In your code you remove duplicates before generating the dictionary. In my case that wasn't possible (there wasn't any). Duplicates came to play when you did the sample_dict.union -phase. Are you forced to do a for loop kinda thing for samples there or do dictionarys have set() -kind of an option? (asking for future purposes:) – jester112358 Feb 26 '15 at 14:27
  • Duplicates should be handled automatically by the union operation. (`sample_dict[gene]` and `samples` are both sets and all operations on sets will remove duplicates automatically. In fact it's impossible to have duplicate entries in a set due to the way the data is stored.) – dshepherd Feb 26 '15 at 17:49