0

I am writing a function to select randomly elements stored in a dictionary:

import random
from liblas import file as lasfile
from collections import defaultdict

def point_random_selection(list,k):
    try:
        sample_point = random.sample(list,k)
    except ValueError:
        sample_point = list
    return(sample_point)

def world2Pixel_Id(x,y,X_Min,Y_Max,xDist,yDist):
    col = int((x - X_Min)/xDist)
    row = int((Y_Max - y)/yDist)
    return("{0}_{1}".format(col,row))

def point_GridGroups(inFile,X_Min,Y_Max,xDist,yDist):
    Groups = defaultdict(list)
    for p in lasfile.File(inFile,None,'r'):
        id = world2Pixel_Id(p.x,p.y,X_Min,Y_Max,xDist,yDist)
        Groups[id].append(p)
    return(Groups)

where k is the number of element to select. Groups is the dictionary

file_out = lasfile.File("outPut",mode='w',header= h)
for m in Groups.iteritems():
   # select k point for each dictionary key 
   point_selected = point_random_selection(m[1],k)
   for l in xrange(len(point_selected)):
     # save the data 
     file_out.write(point_selected[l])
file_out.close()

My problem is that this approach is extremely slow (for file of ~800 Mb around 4 days)

Gianni Spear
  • 7,033
  • 22
  • 82
  • 131
  • `random.sample()` speed is already optimized but if you throw extremely large inputs at it, you have a different problem. What is in `Groups`? Is `Groups` filled with data points from the 800 MB file? – Martijn Pieters Feb 11 '13 at 11:02
  • Dear Martijn yes Gropus is filled with data points from the 800 MB file. Probabely the bottle neck is file_out.write(point_selected[l]) also if liblas is in C++ – Gianni Spear Feb 11 '13 at 11:06
  • How much data do you generate, what are you writing *to*, etc. Did you profile your code and determine that it's `random.sample()` that is slow here or are you just guessing? – Martijn Pieters Feb 11 '13 at 11:08
  • There are ways to take a random sample of lines from a file without reading the whole file into memory. See [Python random N lines from large file (no duplicate lines)](http://stackoverflow.com/q/12279017) and [Python random lines from subfolders](http://stackoverflow.com/q/12128948) – Martijn Pieters Feb 11 '13 at 11:08
  • the problem is i need to read the point file (x,y) give an ID in function of the spatial position inside a grid (ex: 1 m x 1 m) and extract random one point (o more) for each grid. For this reason i need read before the whole point file. – Gianni Spear Feb 11 '13 at 11:13
  • The random sample on read trick can be expanded to cover multiple categories quite easily. It depends on the input data; if you don't need to process the input data based on other data in the same file, you don't need to retain anything in memory other than the sample picked so far. – Martijn Pieters Feb 11 '13 at 11:14
  • Dear Martijn i had other function i wrote. With point_GridGroups i create the dictionary. About your suggestion, please do you have an easy example that i can study? – Gianni Spear Feb 11 '13 at 11:16

1 Answers1

1

You could try and update your samples as you read the coordinates. This at least saves you from having to store everything in memory before running your sample. This is not guaranteed to make things faster.

The following is based off of BlkKnght's excellent answer to build a random sample from file input without retaining all the lines. This just expanded it to keep multiple samples instead.

import random
from liblas import file as lasfile
from collections import defaultdict


def world2Pixel_Id(x, y, X_Min, Y_Max, xDist, yDist):
    col = int((x - X_Min) / xDist)
    row = int((Y_Max - y) / yDist)
    return (col, row)

def random_grouped_samples(infile, n, X_Min, Y_Max, xDist, yDist):
    """Select up to n points *per group* from infile"""

    groupcounts = defaultdict(int)
    samples = defaultdict(list)

    for p in lasfile.File(inFile, None, 'r'):
        id = world2Pixel_Id(p.x, p.y, X_Min, Y_Max, xDist, yDist)
        i = groupcounts[id]
        r = random.randint(0, i)

        if r < n:
            if i < n:
                samples[id].insert(r, p)  # add first n items in random order
            else:
                samples[id][r] = p  # at a decreasing rate, replace random items

        groupcounts[id] += 1

    return samples

The above function takes inFile and your boundary coordinates, as well as the sample size n, and returns grouped samples that have at most n items in each group, picked uniformly.

Because all you use the id for is as a group key, I reduced it to only calculating the col, row tuple, there is no need to make it a string.

You can write these out to a file with:

file_out = lasfile.File("outPut",mode='w',header= h)

for group in samples.itervalues():
    for p in group:
        file_out.write(p)

file_out.close()
Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks Martijn, you are always great!!!. I am looking also for a strategy to pick random and replace values in a Python dictionary. I hope that can speed the system. What do you think? – Gianni Spear Feb 11 '13 at 12:06
  • I have no idea what you mean there. The above code reduces the problem to picking a random number for each read point; if things are still slow then the `liblas` reading/writing code is the bottleneck. Not sure what you could do about that. – Martijn Pieters Feb 11 '13 at 12:09
  • example: running "for m in Groups.iteritems():" m ('3565_179', [, ]) I wish to select random an element and reaplce it inside the dictionary. In the end i have a new dictionary where for each Key there is only one point. Form this new dictionary i wish to save the file – Gianni Spear Feb 11 '13 at 12:17
  • But why did you start with a sample then? Or are you writing a *second* file with *one* random pick from each group? In that case just loop over all the groups, and write out the result of `random.choice()`. No need to replace the lists in the `dict`. – Martijn Pieters Feb 11 '13 at 12:20
  • i am testing your solution: i have always this error message Traceback (most recent call last): File "", line 1, in File "", line 22, in random_grouped_samples IndexError: list assignment index out of range – Gianni Spear Feb 11 '13 at 19:36
  • @Gianni: Ah, I think there was an off-by-one error in there; try again with the update. – Martijn Pieters Feb 11 '13 at 20:08
  • it works. just correct random_grouped_samples(infile, n, X_Min, Y_Max, xDist, yDist) to random_grouped_samples(inFile, n, X_Min, Y_Max, xDist, yDist), with inFile instead of infile – Gianni Spear Feb 11 '13 at 21:50