Python: memory problems in a script

Question

I have wrote a script where i read around 4 million of points and 800.000 plots. The script clip the points inside each plot and save a new text file for each plot.

After a certain period of time my PC memory is full. I had tried to dig inside my script but in each loop for i in xrange(len(sr)): the each object is replaced and the points clipped saved in a new txt file.

are there some strategy to use in this case in order to improve memory usage without reduce the performance(the script is already slow)? I am a beginner in python and sorry if the question is simple.

Thanks in advance Gianni

inFile ="C://04-las_clip_inside_area//prova//Ku_115_class_Notground_normalize.las"
poly ="C://04-las_clip_inside_area//prova//ku_115_plot_clip.shp"
chunkSize = None
MinPoints = 1

sf = shapefile.Reader(poly) #open shpfile
sr = sf.shapeRecords()
poly_filename, ext = path.splitext(poly)
inFile_filename = os.path.splitext(os.path.basename(inFile))[0]
pbar = ProgressBar(len(sr)) # set progressbar
if chunkSize == None:
    points = [(p.x,p.y) for p in lasfile.File(inFile,None,'r')]
    for i in xrange(len(sr)):
        pbar.update(i+1) # progressbar
        verts = np.array(sr[i].shape.points,float)
        record = sr[i].record[0]
        index = nonzero(points_inside_poly(points, verts))[0]
        if len(index) >= MinPoints:
            file_out = open("{0}_{1}_{2}.txt".format(poly_filename, inFile_filename, record), "w")
            inside_points = [lasfile.File(inFile,None,'r')[l] for l in index]
            for p in inside_points:
                file_out.write("%s %s %s %s %s %s %s %s %s %s %s" % (p.x, p.y, p.z, p.intensity,p.return_number,p.number_of_returns,p.scan_direction,p.flightline_edge,p.classification,p.scan_angle,record)+ "\n")
            file_out.close()

this is the origial function

def LAS2TXTClipSplitbyChunk(inFile,poly,chunkSize=1,MinPoints=1):
    sf = shapefile.Reader(poly) #open shpfile
    sr = sf.shapeRecords()
    poly_filename, ext = path.splitext(poly)
    inFile_filename = os.path.splitext(os.path.basename(inFile))[0]
    pbar = ProgressBar(len(sr)) # set progressbar
    if chunkSize == None:
        points = [(p.x,p.y) for p in lasfile.File(inFile,None,'r')]
        for i in xrange(len(sr)):
            pbar.update(i+1) # progressbar
            verts = np.array(sr[i].shape.points,float)
            record = sr[i].record[0]
            index = nonzero(points_inside_poly(points, verts))[0]
            if len(index) >= MinPoints:
                file_out = open("{0}_{1}_{2}.txt".format(poly_filename, inFile_filename, record), "w")
                inside_points = [lasfile.File(inFile,None,'r')[l] for l in index]
                for p in inside_points:
                    file_out.write("%s %s %s %s %s %s %s %s %s %s %s" % (p.x, p.y, p.z, p.intensity,p.return_number,p.number_of_returns,p.scan_direction,p.flightline_edge,p.classification,p.scan_angle,record)+ "\n")
                file_out.close()
    else:
        for i in xrange(len(sr)):
            pbar.update(i+1) # progressbar
            verts = np.array(sr[i].shape.points,float)
            record = sr[i].record[0]
            f = lasfile.File(inFile,None,'r')
            file_out = open("{0}_{1}_{2}.txt".format(poly_filename, inFile_filename, record), "w")
            TotPoints = 0
            while True:
                chunk = list(islice(f,chunkSize))
                if not chunk:
                    break
                points = [(p.x,p.y) for p in chunk]
                index = nonzero(points_inside_poly(points, verts))[0]
                TotPoints += len(index) #add points to count inside th plot
                chunk = [chunk[l] for l in index]
                for p in chunk:
                    file_out.write("%s %s %s %s %s %s %s %s %s %s %s" % (p.x, p.y, p.z, p.intensity,p.return_number,p.number_of_returns,p.scan_direction,p.flightline_edge,p.classification,p.scan_angle,record)+ "\n")
            if TotPoints >= MinPoints:
                file_out.close()
            else:
                file_out.close()
                os.remove("{0}_{1}_{2}.txt".format(poly_filename, inFile_filename, record))
            f.close()

the script by the suggestion of unutbu is:

import shapefile
import os
import glob
from os import path
import numpy as np
from numpy import nonzero
from matplotlib.nxutils import points_inside_poly
from itertools import islice
from liblas import file as lasfile
from shapely.geometry import Polygon
from progressbar import ProgressBar
import multiprocessing as mp


inFile ="C://04-las_clip_inside_area//prova//Ku_115_class_Notground_normalize.las"
poly ="C://04-las_clip_inside_area//prova//ku_115_plot_clip.shp"
chunkSize = None
MinPoints = 1

def pointinside(record):
    verts = np.array(record.shape.points, float)
    record = record.record[0]
    index = nonzero(points_inside_poly(points, verts))[0]
    if len(index) >= MinPoints:
        outfile = "{0}_{1}_{2}.txt".format(poly_filename, inFile_filename, record)
        with open(outfile, "w") as file_out:
            inside_points = [lasfile.File(inFile, None, 'r')[l] for l in index]
            for p in inside_points:
                fields = (p.x, p.y, p.z, p.intensity, p.return_number,
                          p.number_of_returns, p.scan_direction, p.flightline_edge,
                          p.classification, p.scan_angle, record)
                file_out.write(' '.join(map(str, fields)) + "\n")

sf = shapefile.Reader(poly) #open shpfile
sr = sf.shapeRecords()
poly_filename, ext = path.splitext(poly)
inFile_filename = os.path.splitext(os.path.basename(inFile))[0]
pbar = ProgressBar(len(sr)) # set progressbar
if chunkSize == None:
    points = [(p.x,p.y) for p in lasfile.File(inFile,None,'r')]
    for i in xrange(len(sr)):
        pbar.update(i+1) # progressbar
        proc = mp.Process(target = pointinside, args = (sr[i], ))
        proc.start()
        proc.join()

It would help to know some numbers: how many items will there be in the `points` list? What about `inside_points`? — jpa, Oct 13 '12 at 11:28
sorry. i wasted 10 minutes of my life working out what sr, sf, verts, etc actually mean but still got nowhere. Hence my advice is to make your code so easy that dummies understand it. You never know, you might even solve your own problem henceforth. — John Mee, Oct 13 '12 at 11:42

score 5 · Accepted Answer · edited May 23 '17 at 10:24

5

The only reliable way to free memory used for a temporary computation is to run that computation in a subprocess. When the subprocess ends, the memory will be freed.

If you move the code in the outer loop into a function (let's call it work), then you can run work in a subprocess using the multiprocessing module:

import sys
import os
import time
import itertools
import multiprocessing as mp
import numpy as np
import matplotlib.nxutils as nx
import liblas
import shapefile

clock = time.clock if sys.platform == 'win32' else time.time

def LAS2TXTClipSplitbyChunk(inFile, poly, chunkSize = 1, MinPoints = 1):
    sf = shapefile.Reader(poly) #open shpfile
    sr = sf.shapeRecords()
    poly_filename, ext = os.path.splitext(poly)
    for record in sr:
        inFile_filename = os.path.splitext(os.path.basename(inFile))[0]
        record_num = record.record[0]        
        out_filename = '{0}_{1}_{2}.txt'.format(
            poly_filename, inFile_filename, record_num)
        pool.apply_async(pointinside,
                         args = (record, out_filename, inFile, chunkSize, MinPoints),
                         callback = update)

def pointinside(record, out_filename, inFile, chunkSize, MinPoints):
    start = clock()
    record_num = record.record[0]   
    verts = np.array(record.shape.points, float)
    f = iter(liblas.file.File(inFile, None, 'rb'))
    result = []
    worth_writing = False
    for chunk in iter(lambda: list(itertools.islice(f, chunkSize)), []):
        points = [(p.x, p.y) for p in chunk]
        index = nx.points_inside_poly(points, verts)
        chunk = [p for inside, p in itertools.izip(index,chunk) if inside]
        for p in chunk:
            fields = (p.x, p.y, p.z, p.intensity, p.return_number,
                      p.number_of_returns, p.scan_direction, p.flightline_edge,
                      p.classification, p.scan_angle, record_num)
            result.append(' '.join(map(str, fields)))
        if len(result) >= bufferSize:
            # Writing to disk is slow. Doing it once for every iteration is
            # inefficient.  So instead build up bufferSize number of lines
            # before writing them all to disk.
            worth_writing = True
            with open(out_filename, 'a') as file_out:
                file_out.write('\n'.join(result)+'\n')
            result = []
    # In case there were some results (less than bufferSize lines), we
    # dump them to disk here.
    if (len(result) >= MinPoints) or worth_writing:
        with open(out_filename, 'a') as file_out:
            file_out.write('\n'.join(result)+'\n')
    f.close()                    
    end = clock()
    return end-start

def update(result):
    with open(debug_filename, 'a') as f:
        f.write('{r}\n'.format(r = result))

if __name__ == '__main__':
    workdir = 'C://04-las_clip_inside_area//prova//'
    # workdir = os.path.expanduser('~/tmp/tmp')
    os.chdir(workdir)
    inFile = 'Ku_115_class_Notground_normalize.las'
    poly = 'ku_115_plot_clip.shp'
    debug_filename = 'debug.dat'
    chunkSize = None
    MinPoints = 1
    bufferSize = max(MinPoints, 100)

    pool = mp.Pool()
    LAS2TXTClipSplitbyChunk(inFile, poly, chunkSize, MinPoints)
    pool.close()
    pool.join()

Here is a plot of the times each task is taking to complete:

In [129]: import matplotlib.pyplot as plt

In [130]: import numpy as np

In [131]: x = np.genfromtxt('debug.dat')

In [132]: plt.plot(x)
Out[132]: [<matplotlib.lines.Line2D object at 0xe309b4c>]

In [133]: plt.show()

enter image description here

I'm not seeing any progressive slow-down. Perhaps give this code a try.

edited May 23 '17 at 10:24

Community

1
1

answered Oct 13 '12 at 11:30

unutbu

842,883
184
1,785
1,677

Dear unutbu thanks for your help and suggention. I don't wish to abuse of your time but i have the following question because I am curious now. If i wish to create a function (def ClipSlip(inFile,poly,chunkSize= None,MinPoints=1): do i need to create two function (work, before, and ClipSlip, after) or i can insert "work" inside ClipSlip – Gianni Spear Oct 13 '12 at 19:08
yes sorry maybe i wrote wrong (it's saturday night). I make an update in my provius post where i wrote a function "def LAS2TXTClipSplitbyChunk(inFile,poly,chunkSize=1,MinPoints=1)". If i wish to use your suggestion, do i need nested "work" inside "LAS2TXTClipSplitbyChunk:? – Gianni Spear Oct 13 '12 at 19:44
I am testing your solution but it's extremely slow. With not work i can clip 5 plot 1 minute. Using work after 10 minutes i cannot see a file. Futhermore in TAsk manager of windows there are 56 python running – Gianni Spear Oct 13 '12 at 19:49
Could you post your code? The `proc.join()` was included so only 1 subprocess would run at a time... (Show how and where you are calling mp.Process.) – unutbu Oct 13 '12 at 19:50
Ah Windows. You need to "protect" the call to `mp.Process` inside an `if __name__ == '__main__'` block. I will edit the post to show what I mean. – unutbu Oct 13 '12 at 20:00
really thanks, without your help i never figure out i do this. From expertise we can always learn and improve – Gianni Spear Oct 13 '12 at 20:02
My mistake. On Windows, `mp.Process` spawns a subprocess which imports the entire script. Without the `if __name__ == '__main__'`, importing the entire script calls `mp.Process` again, and again, and again, fork bombing your computer. So sorry. – unutbu Oct 13 '12 at 20:05
Please, just last question for curiosity. I really like to learn. if i wish to create a function: 1) do i need nest "def pointinside(record):" inside "def LAS2TXTClipSplitbyChunk(inFile,poly,chunkSize=1,MinPoints=1):"? 2) do i need insert "if __name__ == '__main__'"? – Gianni Spear Oct 13 '12 at 20:09
In your real script, put `if __name__ == '__main__':` after the imports and `def` statements. Inside `LAS2TXTClipSplitbyChunk`, use `proc = mp.Process(target = pointinside, ...); proc.start(); proc.join()` to call `pointinside`. `def pointside` can come before or after `def LAS2TXTClipSplitbyChunk`. – unutbu Oct 13 '12 at 20:13
Ok i start to uderstand. Please correct me if i wrong: 1) create def pointside (before or after) and def LAS2TXTClipSplitbyChunk 2) use proc = mp.Process(target = pointinside, ...) following your example 3) write if __name__ == '__main__': before the script – Gianni Spear Oct 13 '12 at 20:18
Yes, that is right. There is one more issue, however. `pointinside` will only have access to globals and the arguments passed to the function. Any local variable inside `LAS2TXTClipSplitbyChunk` that `pointinside` needs must be passed along to it using the `args` parameter. For example, `args = (sr[i], poly_filename, inFile_filename, ...)` – unutbu Oct 13 '12 at 20:28
i am testing your function. Looks more fast of my old one but myabe i need to fix because there are not the text file output. Not a great saturday night, but i show me new stuff tonight!!! – Gianni Spear Oct 13 '12 at 20:29
the function runs (fast) but there is some problem in the output. – Gianni Spear Oct 13 '12 at 20:31
Since we are only using 1 subprocess, I don't expect this code to be much faster than your old code. Its main purpose was just to ensure that memory would be freed. If you have more than one processor it is possible to run multiple calls to `pointinside` concurrently. However, if the main bottleneck is disk IO, then there will be little speed gained since the processes will be waiting on each other to use the hard drive. – unutbu Oct 13 '12 at 20:47
I think `LAS2TXTClipSplitbyChunk` may be refactored to make the code look simpler. I'll edit my post. – unutbu Oct 13 '12 at 20:49
thanks tonight I am appreciate your help because help me (a real beginner in python, just 3 weeks) to understand inside. Really thank. I will save your code and i will use as teaching. – Gianni Spear Oct 13 '12 at 20:53
There might be an error in how I am handling points. I'll think about it and try to fix... – unutbu Oct 13 '12 at 20:56
yes is here: Traceback (most recent call last): File "C:\PythonScript\LAS2TXTClipSplitbyChunk.py", line 91, in LAS2TXTClipSplitbyChunk(inFile ="C://04-las_clip_inside_area//prova//Ku_115_class_Notground_normalize.las",poly ="C://04-las_clip_inside_area//prova//ku_115_plot_clip.shp",chunkSize=None,MinPoints=1) File "C:\PythonScript\LAS2TXTClipSplitbyChunk.py", line 58, in LAS2TXTClipSplitbyChunk args = (sr[i], points, poly_filename, inFile_filename, NameError: global name 'inFile_filename' is not defined – Gianni Spear Oct 13 '12 at 20:56
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/17983/discussion-between-unutbu-and-gianni) – unutbu Oct 13 '12 at 21:05
I've made a change to hopefully prevent empty files, and maybe even get the progressbar working. Let me now how it goes. – unutbu Oct 14 '12 at 00:40
the code present this message error, but it's close to be perfect. compliemnt to unutbu – Gianni Spear Oct 14 '12 at 13:52
Exception in thread Thread-3: ........ updater.pbar.update(update.i) NameError: global name 'update' is not defined – Gianni Spear Oct 14 '12 at 13:53
Oops, change that to `updater.pbar.update(updater.i)` (The last "update" should be "updater" with an r). – unutbu Oct 14 '12 at 13:55
yep. I am testing right now. This is the report 1) no file with non points, 2) after ~400 the process became really slow. – Gianni Spear Oct 14 '12 at 14:03

Python: memory problems in a script

1 Answers1