2

for example I have two files:

file1:

id chrom start     end       strand
g1  11  98566330    98566433    -
g2  11  98566295    98566433    -
g3  11  98566581    98566836    -

file2

id chrom   start   end      strand  gene_id            gene_name somecol1...somecol10
g1  11  98566330    98566433    -   ENSMUSG00000017210  Med24
g2  11  98566295    98566433    -   ENSMUSG00000017210  Med24
g3  11  98566581    98566836    -   ENSMUSG00000017210  Med24

desired output

id chrom start     end       strand gene_id gene_namesomecol1...somecol10
g1 11  98566330    98566433    -   ENSMUSG00000017210 Med24
g2 11  98566295    98566433    -   ENSMUSG00000017210 Med24
g3 11  98566581    98566836    -   ENSMUSG00000017210 Med24

What I am bascially trying to do is get match id column from both files and if there is a match then print/write some columns from file1 and file2 in a new file ( my current code)

with open('~/outfile.txt', 'w') as w:
     for id1 in c1: #c1 is list where i append each line from file1
         for id2 in d1: #d2 is list where i append each line from file2
             if id1[0] in id2[0]: #is this condition faster (condition1)
         #   if id1[0] == id2[0]:#or this condition is faster (condition2)
                out = ('\t'.join(id2[0:6]),id1[1],id1[2],id2[9],id2[10])
                w.write('\t'.join(out) + '\n')

the issue is this code works as desired with condition2 but it is very slow may be because I am trying to match each line id1[0] == id2[0] between both the list c1 and d1 and also because file2 has like ~500000 rows.

currently i could come up with only two conditions that I am trying to learn that might make the code faster

is there better logic to use that will increase the speed.

EDIT:

I need match file col0 (id) with file2 col(id) and if it is true then slice elements in col0:6, col[1,2] from file1, and col9,10 from file2

desired output

id(file2) chrom(file2) start(file2)     end(file2)       strand(file2) gene_id(file2) gene_name(file2)somecol1(file1)...somecol10(file1)
g1 11  98566330    98566433    -   ENSMUSG00000017210 Med24
g2 11  98566295    98566433    -   ENSMUSG00000017210 Med24
g3 11  98566581    98566836    -   ENSMUSG00000017210 Med24
  • I would suggest AWK solution, but the output is unclear – RomanPerekhrest May 20 '17 at 16:38
  • create `set`s for faster lookup – Jean-François Fabre May 20 '17 at 16:38
  • @Jean-FrançoisFabre i read here http://stackoverflow.com/questions/1388818/how-can-i-compare-two-lists-in-python-and-return-matches about set but it also says that (only works for equal-sized lists, which order-significance implies), my lists are not equal sized – novicebioinforesearcher May 20 '17 at 16:40
  • @RomanPerekhrest output is a new file where i have sliced certain columns from file1 and 2. – novicebioinforesearcher May 20 '17 at 16:42
  • Please show how you parsed you file to see if it has any bearing. That is, show how you compute `c1` and `d1` – EyuelDK May 20 '17 at 16:49
  • @EyuelDK sorry `with open ('file1.txt') as f: for line in f: lines = line.strip().split('\t') # print lines[0] c1.append(lines)` did the same thing with d1 – novicebioinforesearcher May 20 '17 at 16:51
  • is it expected that the ids are always in order? – EyuelDK May 20 '17 at 16:54
  • no..it is not this was just for representation – novicebioinforesearcher May 20 '17 at 16:54
  • forgot to add i skip reading the header `with open ('file1') as f: f.readline() for line in fh:` – novicebioinforesearcher May 20 '17 at 17:00
  • are the ids expected to be unique – EyuelDK May 20 '17 at 17:00
  • @EyuelDK yes they are unique – novicebioinforesearcher May 20 '17 at 17:01
  • I'm surprised no one has answered this yet. Ok, I'm back and let me try. I just need you to clarify what the difference is between the desired output and file2. They seem to be identical – EyuelDK May 20 '17 at 17:27
  • @EyuelDK thank you, may be i have to frame it in a better way i will try to explain i words, I match both files file1 (has 5 columns) and file2 ( has 16 columns) when i print i just take some columns from file 1 and file2 . – novicebioinforesearcher May 20 '17 at 17:33
  • Store the smaller file in memory using a `dict` where the matching ID is the key and the data record is the value. Then just iterate over the bigger file, matching up the records and printing whatever you need. – FMc May 20 '17 at 17:35
  • @novicebioinforesearcher But file2 seems to have all columns that file1 has. So what columns are you trying to extract from file1 that is not present in file2? – EyuelDK May 20 '17 at 17:38
  • 1
    If you're dealing with large data tables and want it to be reasonably efficient you will probably want to invest some time in learning how to use Pandas. It should be possible, with a bit of elbow grease, to read your data files into Pandas DataTables and use its more advanced indexing methods. – Iguananaut May 20 '17 at 18:14
  • Your file examples don't help. I understand that you want to display a subset of the lines from file 2 if the first 3 fields are in file 1. You should add some "junk" in file 2 so we understand that it's been filtered. Currently , expected result is identical to file2. – Jean-François Fabre May 20 '17 at 21:23

2 Answers2

1

If I understand correctly, you want to keep rows from the second file only if the id field is in the first file.

I would use csv module all the way, which is cleaner.

First I would build a set of id fields for fast lookup from file 1 contents (the one with only 5 fields per row)

Then I would read the second file, and write the rows to a third file only if the row id is contained in the set. You'll benefit from the speed of the set lookup:

import csv

with open("file1.txt") as file1:
    cr = csv.reader(file1,delimiter="\t")
    next(cr)  # skip title
    subset = {row[0] for row in cr} # build lookup set in a set comprehension for ids

with open("file2.txt") as file2, open("result.txt","w",newline="") as file3:  # python 2: open("result.txt","wb")
    cr = csv.reader(file2,delimiter="\t")
    cw = csv.writer(file3,delimiter="\t")
    cw.writerow(next(cr))  # write title
    cw.writerows(row for row in cr if row[0] in subset)
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • No i do not want to match all the 5 fields, only field 1 of both the files. this line `out = ('\t'.join(id2[0:6]),id1[1],id1[2],id2[9],id2[10])` basically id2 is columns from file 2 and id1 is columns from file1 , i need only those columns from both the files. Sorry I will improve in asking questions now I have understood the points in asking questions on SO – novicebioinforesearcher May 21 '17 at 01:47
  • 1
    it was just a slight modification to my code. See my edit. – Jean-François Fabre May 21 '17 at 07:15
1

You need to know where time is being spent to know how best to fix it. At a guess its going to be reading and writing files and everything else will amount to ... nearly nothing.

Optimization is good if you can focus on the problem areas - that way everything else can remain as clean readable Python. So, I'd start by profiling your code. cProfile is a good place to start your investigation and, possibly, you may need to create some functions to divide your work up so that you can see what is taking time.

I recently did the edX ITMO competitive programming course and speed was very important. Python I/O was a critical barrier so both reading and writing were optimized. Writes were done with significant blocks where possible so you may need to aggregate data before writing. Python's memory mapped reads were used to speed up reading. To give you an example of the relative ease of using mmap, the commented out code at the top performs the mmap equivalent of the uncommented readlines below:

    # with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
    #     end = mm.size()
    #     while mm.tell() < end:
    #         for l in mm.readline().split():
    #             print(';'.join(l.decode('ascii').split(',')))
    #  with open("out1", 'w') as o:
    with open(filename, 'r') as f:
        for l in f.readlines():
            #  ll.append('.'.join(l.split(',')))
            #  o.writelines(l[13:])
            #  print(l[13:], end='')
            print('.'.join(l.split(',')), end='')
John 9631
  • 527
  • 5
  • 13