How can I improve the speed of a python script that compares two lists and a value between a range?

Question

I have two large file data sets:

File1:
Gen1 1 1 10
Gen2 1 2 20
Gen3 2 30 40

File2:
A 1 4
B 1 15
C 2 2

Expected output:

Out:
Gen1 1 1 10 A 1 4
Gen2 1 2 20 B 1 15

Now I have code which basically is just trying to find instances where file 2 is in file 1 if the file2[1] matches file1[1] and falls between the range in file 1.

My code that does this is below:

for i in file1:

    temp = i.split()

    for a in file2:

        temp2 = a.split()

        if temp[1] == temp2[1] and temp2[2] >= temp[2] and temp2[2] <= temp[3]

             print(i + " " + a + "\n")

        else:

            continue

The code works, but I feel that it takes a lot longer than it should. Is there a simpler way or method to do this? I feel that there is some sort of clever use of map or hashes that I'm not doing.

Thank you!

use pandas, this use a compiled backend and will be a one liner — maxymoo, Mar 16 '17 at 23:25
I've never heard of pandas, I'll admit my coding is a bit intro. I'll look into that. — perot57, Mar 16 '17 at 23:28
It depends upon how large the files are. If they fit into memory, use Pandas. If not, a problem like this will be typically solved using a database. You are using `print` so I am guessing that the files are not *that* big? — ssm, Mar 17 '17 at 02:37
I used print in the above example, in the actual code they the lines are being written to a file. The files are maybe 1/2 GB. — perot57, Mar 17 '17 at 15:49

score 0 · Answer 1 · edited May 23 '17 at 11:46

Pandas could be a good choice. See this example.

I prefer sqlite over pandas when files are big. Pandas dataframes can be loaded from sqlite DB.

import sqlite3

file1 = """Gen1 1 1 10
Gen2 1 2 20
Gen3 2 30 40"""

file2 = """A 1 4
B 1 15
C 2 2"""

# your code (fixed)
print("desired output")
for i in file1.splitlines():
    temp = i.split()
    for a in file2.splitlines():
        temp2 = a.split()
        if temp[1] == temp2[1] and int(temp2[2]) >= int(temp[2]) and int(temp2[2]) <= int(temp[3]):
            print(i + " " + a)


# Make an in-memory db
# Set a filename if your files are too big or if you want to reuse this db
con = sqlite3.connect(":memory:")
c = con.cursor()

c.execute("""CREATE TABLE file1
(
    gene_name text,
    a integer,
    b1 integer,
    b2 integer
)""")

for row in file1.splitlines():
    if row:
        c.execute("INSERT INTO file1 (gene_name, a, b1, b2) VALUES (?,?,?,?)", tuple(row.split()))

c.execute("""CREATE TABLE file2
(
    name text,
    a integer,
    b integer
)""")

for row in file2.splitlines():
    if row:
        c.execute("INSERT INTO file2 (name, a, b) VALUES (?,?,?)", tuple(row.split()))

# join tow tables
print("sqlite3 output")
for row in c.execute("""SELECT
    file1.gene_name,
    file1.a,
    file1.b1,
    file1.b2,
    file2.name,
    file2.a,
    file2.b
FROM file1
JOIN file2 ON file1.a = file2.a AND file2.b >= file1.b1 AND file2.b <= file1.b2
"""):
    print(row)

con.close()

Output:

desired output
Gen1 1 1 10 A 1 4
Gen2 1 2 20 A 1 4
Gen2 1 2 20 B 1 15
sqlite3 output
(u'Gen1', 1, 1, 10, u'A', 1, 4)
(u'Gen2', 1, 2, 20, u'A', 1, 4)
(u'Gen2', 1, 2, 20, u'B', 1, 15)

How can I improve the speed of a python script that compares two lists and a value between a range?

1 Answers1