3

I am having issues figuring out how to sort a large data set into more useful data.

The original file in CSV format is shown below- the data indicates x,y,z positions and finally energy. The x,y,z coordinates spread for quite a ways this is a small snippet below- basically it was an energy search over a volume.

-2.800000,-1.000000,5.470000,-0.26488315
-3.000000,1.000000,4.070000,-0.81185718
-2.800000,-1.000000,3.270000,1.29303723
-2.800000,-0.400000,4.870000,-0.51165026

Unfortunately its very difficult to plot in the requisite four dimensions so I need to trim this data. I would like to do this in such a way that I will turn the volume into a surface on the lowest energy z axis. On smaller data sets this was simple, in excel sort by X then Y and then energy, then delete all energies above the lowest. This was easy enough for small sets of data but has quickly become problematic.

I have tried various ways of doing this such as splitting the csv and using the sort command, but I am having little luck. Any advice on how to approach this would be much appreciated.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Daniel
  • 19,179
  • 7
  • 60
  • 74

3 Answers3

4

This does what you ask in your comment to Raymond's answer -- returns just the row with the lowest z for each x, y pair:

from operator import itemgetter
from itertools import groupby
from csv import reader


def min_z(iterable):
    # the data converted from strings to numbers
    floats = [[float(n) for n in row] for row in iterable]
    # the data sorted by x, y, z
    floats.sort(key=lambda (x, y, z, e): (x, y, z))
    # group the data by x, y
    grouped_floats = groupby(floats, key=itemgetter(slice(0, 2)))
    # return the first item from each group
    # because the data is sorted
    # the first item is the smallest z for the x, y group
    return [next(rowgroup) for xy, rowgroup in grouped_floats]


data = """-2.800000,-1.000000,5.470000,-0.26488315
-3.000000,1.000000,4.070000,-0.81185718
-2.800000,-1.000000,3.270000,1.29303723
-2.800000,-0.400000,4.870000,-0.51165026""".splitlines()


print min_z(reader(data))

Prints:

[[-3.0, 1.0, 4.07, -0.81185718], 
 [-2.8, -1.0, 3.27, 1.29303723], 
 [-2.8, -0.4, 4.87, -0.51165026]]
agf
  • 171,228
  • 44
  • 289
  • 238
  • This seems to work great the two problem I am having implementing is that the z needs to be the minimum energy and currently you are getting the maximum. – Daniel Nov 07 '11 at 00:16
  • Not sure what you mean.. it groups by column 1 and 2 and takes the minumum of column 3 for each group. Do you want to group on column 4 rather than 3? – agf Nov 07 '11 at 00:27
  • I've modified it slightly. If you want to sort by different columns just change `x, y, z` to `x, y, e` or whatever. – agf Nov 07 '11 at 00:32
  • Ah the results make more sense now. Yes the fourth column is the one that we need to take the minimum of. – Daniel Nov 07 '11 at 00:35
  • Works perfectly just need to get it to open a csv file! Thank you for the help. – Daniel Nov 07 '11 at 00:40
  • @Ophion Where I do `min_z(reader(data))` just do `min_z(reader(open('path/file.name', 'r')))`. – agf Nov 07 '11 at 04:36
2

After reading the data into a list of tuples with the csv.reader, sort the data by (x, y) values. For clarity, use named tuples to identify the fields.

Then use itertools.groupby to cluster the related (x, y) data points. For each group, use min to isolate the one with the lowest energy:

>>> import csv, collections, itertools

>>> raw_data = '''\
-2.800000,-1.000000,5.470000,-0.26488315
-3.000000,1.000000,4.070000,-0.81185718
-2.800000,-1.000000,3.270000,1.29303723
-2.800000,-0.400000,4.870000,-0.51165026
'''.splitlines()

>>> Sample = collections.namedtuple('Sample', ['x', 'y', 'z', 'energy'])
>>> data = [Sample(*row) for row in csv.reader(raw_data)]
>>> data.sort(key=lambda s: (s.x, s.y))
>>> for xy, group in itertools.groupby(data, key=lambda s: (s.x, s.y)):
        print min(group, key=lambda s: s.energy)


Sample(x='-2.800000', y='-0.400000', z='4.870000', energy='-0.51165026')
Sample(x='-2.800000', y='-1.000000', z='5.470000', energy='-0.26488315')
Sample(x='-3.000000', y='1.000000', z='4.070000', energy='-0.81185718')
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • When I did this method I was unable to figure out how to snip out the Samples the other energies. So in the output you have I would want to keep like 1, 2, and 4- but delete line 3. – Daniel Nov 06 '11 at 21:40
  • Just edited the answer to show the use of a list comprehension to filter-out unwanted samples. – Raymond Hettinger Nov 06 '11 at 22:45
  • Right but that is not always the case each specific x,y coordinate needs the lowest energy picked from the possible z coordinates and that is not always positive or there is other negative values. Currently I'm trying to separate the lines into x,y groupings that are the same and then remove all but the lowest energy one. – Daniel Nov 06 '11 at 23:10
  • Okay, I've editted to show the groupings by (x, y) coordinate and finding the minimum of each group. Use caution when grouping by a floating point value, even small differences will create distinct groups. – Raymond Hettinger Nov 07 '11 at 00:07
0

I think numpy's lexsort will address your sorting needs.

In general I think your steps are:

  1. Read csv into numpy array -- have you tried python's csv package or numpy's genfromtext() function?

  2. Sort using lexsort

  3. Trim off unnecessary rows

EDIT: See this related SO question.

Community
  • 1
  • 1
jedwards
  • 29,432
  • 3
  • 65
  • 92
  • That looks very promising it seems to be a simpler method then what I was trying. There may be a creative way to sort it so that all unwanted rows are on the bottom. – Daniel Nov 06 '11 at 21:41