2

I have put together a simple Python script which reads a large list of algebraic expressions from a text file on separate lines, evaluates the mathematics on each line and puts it into a numpy array. The eigenvalues of this matrix are then found. The parameters A,B,C will then be changed and the program run again, hence a function is used to achieve this.

Some of these text files will have millions of lines of equations, so after profiling the code I found that the eval command accounts for approximately 99% of the execution time. I am aware of the dangers of using eval but this code will only ever be used by myself. All other parts of the code are fast, except the call to eval.

Here is the code where mat_size is set to 500 which represents a 500*500 array meaning 250,000 lines of equations are being read in from the file. I cannot provide the file as it is ~ 0.5GB in size, but have provided an example of what it looks like below and it only uses basic mathematical operations.

import numpy as np
from numpy import *
from scipy.linalg import eigvalsh

mat_size = 500

# Read the file line by line
with open("test_file.txt", 'r') as f:
    lines = f.readlines() 

# Function to evaluate the maths and build the numpy array
def my_func(A,B,C):
    lst = []
    for i in lines:   
        # Strip the \n
        new = eval(i.rstrip())
        lst.append(new)
    # Build the numpy array
    AA = np.array(lst,dtype=np.float64)
    # Resize it to mat_size
    matt = np.resize(AA,(mat_size,mat_size))
    return matt 

# Function to find eigenvalues of matrix
def optimise(x):
    A,B,C = x
    test = my_func(A,B,C)
    ev=-1*eigvalsh(test)
    return ev[-(1)]   

# Define what A,B,C are, this can be changed each time the program is run
x0 = [7.65,5.38,4.00]

# Print result
print(optimise(x0))

A few lines of an example input text file: (mat_size can be changed to 2 to run this file)

.5/A**3*B**5+C
35.5/A**3*B**5+3*C
.8/C**3*A**5+C**9
.5/A*3+B**5-C/45

I am aware eval is usually bad practice and slow, so I looked for other means to achieving a speed up. I tried methods outlined here but none of these appeared to work. I also tried applying sympy to the problem but that caused a massive slowdown. What is a better way of going about this problem?

EDIT

From the suggestion to use numexpr instead, I have come across an issue where it grinds to a halt compared to the standard eval. For some instances the matrix elements contain quite a lot of algebraic expressions. Here is an example of just one matrix element, i.e one of the equations in the file (it contains a few more terms not defined in the code above, but can be easily defined at top of the code):

-71*A**3/(A+B)**7-61*B**3/(A+B)**7-3/2/B**2/C**2*A**6/(A+B)**7-7/4/B**3/m3*A**6/(A+B)**7-49/4/B**2/C*A**6/(A+B)**7+363/C*A**3/(A+B)**7*z3+451*B**3/C/(A+B)**7*z3-3/2*B**5/C/A**2/(A+B)**7-3/4*B**7/C/A**3/(A+B)**7-1/B/C**3*A**6/(A+B)**7-3/2/B**2/C*A**5/(A+B)**7-107/2/C/m3*A**4/(A+B)**7-21/2/B/C*A**4/(A+B)**7-25/2*B/C*A**2/(A+B)**7-153/2*B**2/C*A/(A+B)**7-5/2*B**4/C/m3/(A+B)**7-B**6/C**3/A/(A+B)**7-21/2*B**4/C/A/(A+B)**7-7/4/B**3/C*A**7/(A+B)**7+86/C**2*A**4/(A+B)**7*z3+90*B**4/C**2/(A+B)**7*z3-1/4*B**6/m3/A**3/(A+B)**7-149/4/B/C*A**5/(A+B)**7-65*B**2/C**3*A**4/(A+B)**7-241/2*B/C**2*A**4/(A+B)**7-38*B**3/C**3*A**3/(A+B)**7+19*B**2/C**2*A**3/(A+B)**7-181*B/C*A**3/(A+B)**7-47*B**4/C**3*A**2/(A+B)**7+19*B**3/C**2*A**2/(A+B)**7+362*B**2/C*A**2/(A+B)**7-43*B**5/C**3*A/(A+B)**7-241/2*B**4/C**2*A/(A+B)**7-272*B**3/C*A/(A+B)**7-25/4*B**6/C**2/A/(A+B)**7-77/4*B**5/C/A/(A+B)**7-3/4*B**7/C**2/A**2/(A+B)**7-23/4*B**6/C/A**2/(A+B)**7-11/B/C**2*A**5/(A+B)**7-13/B**2/m3*A**5/(A+B)**7-25*B/C**3*A**4/(A+B)**7-169/4/B/m3*A**4/(A+B)**7-27*B**2/C**3*A**3/(A+B)**7-47*B/C**2*A**3/(A+B)**7-27*B**3/C**3*A**2/(A+B)**7-38*B**2/C**2*A**2/(A+B)**7-131/4*B/m3*A**2/(A+B)**7-25*B**4/C**3*A/(A+B)**7-65*B**3/C**2*A/(A+B)**7-303/4*B**2/m3*A/(A+B)**7-5*B**5/C**2/A/(A+B)**7-49/4*B**4/m3/A/(A+B)**7-1/2*B**6/C**2/A**2/(A+B)**7-5/2*B**5/m3/A**2/(A+B)**7-1/2/B/C**3*A**7/(A+B)**7-3/4/B**2/C**2*A**7/(A+B)**7-25/4/B/C**2*A**6/(A+B)**7-45*B/C**3*A**5/(A+B)**7-3/2*B**7/C**3/A/(A+B)**7-123/2/C*A**4/(A+B)**7-37/B*A**4/(A+B)**7-53/2*B*A**2/(A+B)**7-75/2*B**2*A/(A+B)**7-11*B**6/C**3/(A+B)**7-39/2*B**5/C**2/(A+B)**7-53/2*B**4/C/(A+B)**7-7*B**4/A/(A+B)**7-7/4*B**5/A**2/(A+B)**7-1/4*B**6/A**3/(A+B)**7-11/C**3*A**5/(A+B)**7-43/C**2*A**4/(A+B)**7-363/4/m3*A**3/(A+B)**7-11*B**5/C**3/(A+B)**7-45*B**4/C**2/(A+B)**7-451/4*B**3/m3/(A+B)**7-5/C**3*A**6/(A+B)**7-39/2/C**2*A**5/(A+B)**7-49/4/B**2*A**5/(A+B)**7-7/4/B**3*A**6/(A+B)**7-79/2/C*A**3/(A+B)**7-207/2*B**3/C/(A+B)**7+22/B/C**2*A**5/(A+B)**7*z3+94*B/C**2*A**3/(A+B)**7*z3+76*B**2/C**2*A**2/(A+B)**7*z3+130*B**3/C**2*A/(A+B)**7*z3+10*B**5/C**2/A/(A+B)**7*z3+B**6/C**2/A**2/(A+B)**7*z3+3/B**2/C**2*A**6/(A+B)**7*z3+7/B**3/C*A**6/(A+B)**7*z3+52/B**2/C*A**5/(A+B)**7*z3+169/B/C*A**4/(A+B)**7*z3+131*B/C*A**2/(A+B)**7*z3+303*B**2/C*A/(A+B)**7*z3+49*B**4/C/A/(A+B)**7*z3+10*B**5/C/A**2/(A+B)**7*z3+B**6/C/A**3/(A+B)**7*z3-3/4*B**7/C/m3/A**3/(A+B)**7-7/4/B**3/C/m3*A**7/(A+B)**7-49/4/B**2/C/m3*A**6/(A+B)**7-149/4/B/C/m3*A**5/(A+B)**7-293*B/C/m3*A**3/(A+B)**7+778*B**2/C/m3*A**2/(A+B)**7-480*B**3/C/m3*A/(A+B)**7-77/4*B**5/C/m3/A/(A+B)**7-23/4*B**6/C/m3/A**2/(A+B)**7

numexpr completely chokes when the matrix elements are of this form, whereas eval evaluates it instantaneously. For just a 10*10 matrix (100 equations in file) numexpr takes about 78 seconds to process the file, whereas eval takes 0.01 seconds. Profiling the code that uses numexpr reveals that the getExprnames and precompile function of numexpr are the causes of the issue with precompile taking 73.5 seconds of the total time and getExprNames taking 3.5 seconds of the time. Why would the precompile cause such a bottleneck in this particular calculation along with the getExprNames? Is this module just not well suited to long algebraic expressions?

Yeti
  • 425
  • 4
  • 15

1 Answers1

1

I found a way to speed eval() up in this particular instance by making use of the multiprocessing library. I read the file in as usual, but then break the list into equal sized sub-lists which can then be processed separately on different CPU's and the evaluated sub-lists recombined at the end. This offers a nice speedup over the original method. I am sure the code below can be simplified/optimised; but for now it works (for instance what if there is a prime number of list elements? this will mean unequal lists). Some rough benchmarks show it is ~ 3 times faster using the 4 CPU's of my laptop. Here is the code:

from multiprocessing import Process, Queue

with open("test.txt", 'r') as h:
    linesHH = h.readlines()


# Get the number of list elements
size = len(linesHH)

# Break apart the list into the desired number of chunks
chunk_size = size/4
chunks = [linesHH[x:x+chunk_size] for x in xrange(0, len(linesHH), chunk_size)]

# Declare variables
A = 0.1
B = 2
C = 2.1
m3 = 1
z3 = 2

# Declare all the functions that process the substrings
def my_funcHH1(A,B,C,que):   #add a argument to function for assigning a queue to each chunk function
    lstHH1 = []
    for i in chunks[0]:
        HH1 = eval(i)
        lstHH1.append(HH1)
    que.put(lstHH1)

def my_funcHH2(A,B,C,que):    
    lstHH2 = []
    for i in chunks[1]:
        HH2 = eval(i)
        lstHH2.append(HH2)
    que.put(lstHH2)

def my_funcHH3(A,B,C,que):    
    lstHH3 = []
    for i in chunks[2]:
        HH3 = eval(i)
        lstHH3.append(HH3)
    que.put(lstHH3)

def my_funcHH4(A,B,C,que):    
    lstHH4 = []
    for i in chunks[3]:
        HH4 = eval(i)
        lstHH4.append(HH4)
    que.put(lstHH4)

queue1 = Queue()    
queue2 = Queue()  
queue3 = Queue()  
queue4 = Queue()  

# Declare the processes
p1 = Process(target= my_funcHH1, args= (A,B,C,queue1))    
p2 = Process(target= my_funcHH2, args= (A,B,C,queue2))
p3 = Process(target= my_funcHH3, args= (A,B,C,queue3))
p4 = Process(target= my_funcHH4, args= (A,B,C,queue4))

# Start them
p1.start()
p2.start()
p3.start()
p4.start()

HH1 = queue1.get()
HH2 = queue2.get()
HH3 = queue3.get()
HH4 = queue4.get()
p1.join()
p2.join()
p3.join()
p4.join()

# Obtain the final result by combining lists together again.
mergedlist = HH1 + HH2 + HH3 + HH4
Yeti
  • 425
  • 4
  • 15