1

How could I delete the rows which have '0' as a value on 5th column? Or even better, Can we choose the range (ie. remove the rows which have values between -50 and 30 on 5th column)?

data looks like this:

 0  4028.44  4544434.50    -6.76  -117.00  0.0002   0.12
 0  4028.50  3455014.50    -5.86  0        0.0003   0.39
 0  7028.56  4523434.50    -4.95  -137.00  0.0005   0.25
 0  8828.62  4543414.50    -3.05  0        0.0021   0.61
 0  4028.44  4544434.50    -6.76  -107.00  0.0002   0.12
 0  4028.50  3455014.50    -5.86  -11.00   0.0003   0.39
 0  7028.56  4523434.50    -4.95  -127.00  0.0005   0.25
 0  8828.62  4543414.50    -3.05  0        0.0021   0.61
Mat
  • 202,337
  • 40
  • 393
  • 406
Chad
  • 455
  • 1
  • 5
  • 9

3 Answers3

4
goodrows = [row for row in data if row.split()[4] != '0']

or

goodrows = [row for row in data if not (-50 <= float(row.split()[4]) <= 30)]

Edit:

If your data is actually in a NumPy array, which your comment seems to indicate even if your post didn't:

goodrows = [row for row in data if row[4] != 0]

or

goodrows = [row for row in data if not (-50 <= row[4] <= 30)]

should work. There is definitely a NumPy internal way to do this though.

agf
  • 171,228
  • 44
  • 289
  • 238
  • I've just tested this to see if they are identical: they're not. `int(row.split()[4])` `raise`s when it encounters `-117.00`. That may explain the -1... – johnsyweb Aug 09 '11 at 01:27
  • @Johnsyweb absolutely right, good catch. +1 to your answer. Note: I was not one of the downvoters. – agf Aug 09 '11 at 01:32
  • I get 'AttributeError: 'numpy.ndarray' object has no attribute 'split'' error with this one too. – Chad Aug 09 '11 at 15:33
  • Ok, if it's already in an array, not in a list of strings in a file, just do `row[4]`. See my edit. Next time, make sure to say in your question if the data is in a numPy array. We all assumed it was in a file in the format you posted. – agf Aug 09 '11 at 15:44
2

you can use numpy to do this quickly:

data="""
0  4028.44  4544434.50    -6.76  -117.00  0.0002   0.12
0  4028.50  3455014.50    -5.86  0        0.0003   0.39
0  7028.56  4523434.50    -4.95  -137.00  0.0005   0.25
0  8828.62  4543414.50    -3.05  0        0.0021   0.61
0  4028.44  4544434.50    -6.76  -107.00  0.0002   0.12
0  4028.50  3455014.50    -5.86  -11.00   0.0003   0.39
0  7028.56  4523434.50    -4.95  -127.00  0.0005   0.25
0  8828.62  4543414.50    -3.05  0        0.0021   0.61
"""
from StringIO import StringIO
import numpy as np
d = np.loadtxt(StringIO(data)) # load the text in to a 2d numpy array

print d[d[:,4]!=0]  # choose column 5 != 0
print d[(d[:,4]>=50)|(d[:,4]<=-30)] # choose column 5 >=50 or <=-30
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • 2
    I don't know if numpy is the right tool as it's not on std library... A list comprehension seems better – JBernardo Aug 09 '11 at 01:33
  • I got this error: File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/numpy/lib/npyio.py", line 796, in loadtxt items = [conv(val) for (conv, val) in zip(converters, vals)] ValueError: could not convert string to float: [[ – Chad Aug 09 '11 at 15:34
  • the program above can only convert numbers split by space. From the error message, it seems that you are trying some other data format. – HYRY Aug 09 '11 at 21:33
1

Assuming your data is in a plain text file like this:

$ cat data.txt 
0  4028.44  4544434.50    -6.76  -117.00  0.0002   0.12
0  4028.50  3455014.50    -5.86  0        0.0003   0.39
0  7028.56  4523434.50    -4.95  -137.00  0.0005   0.25
0  8828.62  4543414.50    -3.05  0        0.0021   0.61
0  4028.44  4544434.50    -6.76  -107.00  0.0002   0.12
0  4028.50  3455014.50    -5.86  -11.00   0.0003   0.39
0  7028.56  4523434.50    -4.95  -127.00  0.0005   0.25
0  8828.62  4543414.50    -3.05  0        0.0021   0.61

And you are not using any external libraries. The following will read the data into a list of strings, omiting the undesirable lines. You can feed these lines into any other function you choose. I call print merely to demonstrate. N.B: The fifth column has index '4', since list indices are zero-based.

$ cat data.py 
#!/usr/bin/env python

print "1. Delete the rows which have '0' as a value on 5th column:"

def zero_in_fifth(row):
    return row.split()[4] == '0'

required_rows = [row for row in open('./data.txt') if not zero_in_fifth(row)]
print ''.join(required_rows)

print '2. Choose the range (i.e. remove the rows which have values between -50 and 30 on 5th column):'

def should_ignore(row):
    return -50 <= float(row.split()[4]) <= 30

required_rows = [row for row in open('./data.txt') if not should_ignore(row)]
print ''.join(required_rows)

When you run this you will get:

$ python data.py 
1. Delete the rows which have '0' as a value on 5th column:
0  4028.44  4544434.50    -6.76  -117.00  0.0002   0.12
0  7028.56  4523434.50    -4.95  -137.00  0.0005   0.25
0  4028.44  4544434.50    -6.76  -107.00  0.0002   0.12
0  4028.50  3455014.50    -5.86  -11.00   0.0003   0.39
0  7028.56  4523434.50    -4.95  -127.00  0.0005   0.25

2. Choose the range (i.e. remove the rows which have values between -50 and 30 on 5th column):
0  4028.44  4544434.50    -6.76  -117.00  0.0002   0.12
0  7028.56  4523434.50    -4.95  -137.00  0.0005   0.25
0  4028.44  4544434.50    -6.76  -107.00  0.0002   0.12
0  7028.56  4523434.50    -4.95  -127.00  0.0005   0.25
johnsyweb
  • 136,902
  • 23
  • 188
  • 247
  • Don't you think `lambda`s are overkill for this? – agf Aug 09 '11 at 00:51
  • What's the point of naming a lamda function? That's just wrong. Just use the `def` keyword. – JBernardo Aug 09 '11 at 01:18
  • @JBernardo: A named function would probably be better, you're right. I just extracted the `lambda` from the generator expression to reduce the line-length. – johnsyweb Aug 09 '11 at 01:29
  • 1
    As said above, that's not the place to use `lambda`s. Wrong in many levels. Try reading [that](http://stackoverflow.com/questions/1892324/why-program-functionally-in-python/1892614#1892614)... –  Aug 09 '11 at 01:29
  • @Franklin: Thanks to you and JBernardo, that is an very interesting read. I've updated my answer, accordingly. Perhaps the voters will follow suit? – johnsyweb Aug 09 '11 at 01:40
  • I get this error: AttributeError: 'numpy.ndarray' object has no attribute 'split', line 10, in required_rows = (row for row in data.split('\n') if not should_ignore(row)) – Chad Aug 09 '11 at 02:10
  • @Chad: Erm... My answer does not use `numpy.ndarray`, or even `numpy`. Is that how you're storing `data` (I had taken it to be a `string`)? – johnsyweb Aug 09 '11 at 03:28
  • actually the data comes from a netcdf file , so I load it with the netCDF4 module – Chad Aug 09 '11 at 06:14
  • @Chad: Then you probably want to remove the calls to `.split()` and `.split('\n')`. – johnsyweb Aug 09 '11 at 06:18
  • @Chad: Or use the `numpy` answer ;-) – johnsyweb Aug 09 '11 at 06:48
  • 1
    @Johnsyweb: I loaded the data from a text file via pylab.loadtxt and try your code but I got the same error with the same line. what am I missing here? – Chad Aug 09 '11 at 15:29
  • You are missing information about the shape of your data and the third-party libraries you are using from your question. I have re-done my answer to show code that will work without any such libraries. – johnsyweb Aug 09 '11 at 22:18