Python lists, csv, duplication removal

Question

Even though I'm very new with python, I can't understand how I haven't been able to solve this issue / take a right approach. So any help, link to a helpful tutorial is appreciated highly as I have to do this kind of stuff from time to time.

I have a CSV file that I need to reformat / modify a bit.

I need to store the amount of samples that the gene is in.

input file:

AHCTF1: Sample1, Sample2, Sample4
AHCTF1: Sample2, Sample7, Sample12
AHCTF1: Sample5, Sample6, Sample7

result:

 AHCTF1 in 7 samples (Sample1, Sample2, Sample4, Sample5, Sample6, Sample7, Sample12)

code:

f = open("/CSV-sorted.csv")
gene_prev = ""

hit_list = []

csv_f = csv.reader(f)

for lines in csv_f:

    #time.sleep(0.1)
    gene = lines[0]
    sample = lines[11].split(",")
    repeat = lines[8]

    for samples in sample:
        hit_list.append(samples)

    if gene == gene_prev:

        for samples in sample:

            hit_list.append(samples)

        print gene
        print hit_list
        print set(hit_list)
        print "samples:", len(set(hit_list))


    hit_list = []

    gene_prev = gene

So in a nutshell I'd like to combine the hits for every gene and make a set from them to remove duplications.

Maybe dictionary would be the way to do it:s ave gene as a key and add samples as values?

Found this - Similar / useful: How can I combine dictionaries with the same keys in python?

I'd like the code to write just one line for each gene and the samples that they share. — jester112358, Aug 25 '14 at 13:09
Your input file does not look like csv you're talking about. Could you provide a better sample with at least 2 different gene identifiers? — Vladimir, Aug 25 '14 at 13:14
It removes the dups, in every iteration. Ideally I'd like add samples to the list and do the dup removal after last matching gene. I really appreciate all the attention / help so far. Thanks. — jester112358, Aug 25 '14 at 13:29
Does your result have to be in the order specified (as they'd have to be some criteria oh how that'd happen), or does it not matter as long as you have the unique samples? — Jon Clements, Aug 29 '14 at 10:08

dshepherd · Accepted Answer · 2014-08-25T13:37:32.317

The standard way to remove duplicates is to convert to a set.

However I think there's some stuff wrong with the way you're reading the file. First problem: it isn't a csv file (you have a colon between the first two fields). Second what is

gene = lines[0]
sample = lines[11].split(",")
repeat = lines[8]

supposed to do?

If I was writing this I would replace the ":" with another ",". So with this modification and using a dictionary of sets your code would look something like:

# Read in csv file and convert to list of list of entries. Use with so that 
# the file is automatically closed when we are done with it
csvlines = []
with open("CSV-sorted.csv") as f:
    for line in f:
        # Use strip() to clean up trailing whitespace, use split() to split
        # on commas.
        a = [entry.strip() for entry in line.split(',')]
        csvlines.append(a)

# I'll print it here so you can see what it looks like:
print(csvlines)



# Next up: converting our list of lists to a dict of sets.

# Create empty dict
sample_dict = {}

# Fill in the dict
for line in csvlines:
    gene = line[0] # gene is first entry
    samples = set(line[1:]) # rest of the entries are samples

    # If this gene is in the dict already then join the two sets of samples
    if gene in sample_dict:
        sample_dict[gene] = sample_dict[gene].union(samples)

    # otherwise just put it in
    else:
        sample_dict[gene] = samples


# Now you can print the dictionary:
print(sample_dict)

The output is:

[['AHCTF1', 'Sample1', 'Sample2', 'Sample4'], ['AHCTF1', 'Sample2', 'Sample7', 'Sample12'], ['AHCTF1', 'Sample5', 'Sample6', 'Sample7']]
{'AHCTF1': {'Sample12', 'Sample1', 'Sample2', 'Sample5', 'Sample4', 'Sample7', 'Sample6'}}

where the second line is your dictionary.

Hi & thanks. This looks really nice! Sorry my example was poor, but there is nothing wrong with the input file. In the input file the gene is located in the first column and samples (comma separated -> splitting) that has it, in the 12th column. Repeat is not doing anything currently, but I left it since I need it later. — jester112358, Aug 26 '14 at 06:11
Ah ok, do you understand how to do it now though? (basically just replace the first half of my code with the first half of yours) — dshepherd, Aug 26 '14 at 09:47
Late reply :) In your code you remove duplicates before generating the dictionary. In my case that wasn't possible (there wasn't any). Duplicates came to play when you did the sample_dict.union -phase. Are you forced to do a for loop kinda thing for samples there or do dictionarys have set() -kind of an option? (asking for future purposes:) — jester112358, Feb 26 '15 at 14:27
Duplicates should be handled automatically by the union operation. (`sample_dict[gene]` and `samples` are both sets and all operations on sets will remove duplicates automatically. In fact it's impossible to have duplicate entries in a set due to the way the data is stored.) — dshepherd, Feb 26 '15 at 17:49

Python lists, csv, duplication removal

1 Answers1

Linked