Eliminating rows with a specific value in a column using Python

Question

How could I delete the rows which have '0' as a value on 5th column? Or even better, Can we choose the range (ie. remove the rows which have values between -50 and 30 on 5th column)?

data looks like this:

 0  4028.44  4544434.50    -6.76  -117.00  0.0002   0.12
 0  4028.50  3455014.50    -5.86  0        0.0003   0.39
 0  7028.56  4523434.50    -4.95  -137.00  0.0005   0.25
 0  8828.62  4543414.50    -3.05  0        0.0021   0.61
 0  4028.44  4544434.50    -6.76  -107.00  0.0002   0.12
 0  4028.50  3455014.50    -5.86  -11.00   0.0003   0.39
 0  7028.56  4523434.50    -4.95  -127.00  0.0005   0.25
 0  8828.62  4543414.50    -3.05  0        0.0021   0.61

`operator.itemgetter(4)`... then compare it. – JBernardo Aug 09 '11 at 01:15 — JBernardo, Aug 09 '11 at 01:15
@Chad: Did you get this working yet? – johnsyweb Aug 11 '11 at 22:41 — johnsyweb, Aug 11 '11 at 22:41

agf · Answer 1 · 2011-08-09T15:45:33.547

4

goodrows = [row for row in data if row.split()[4] != '0']

or

goodrows = [row for row in data if not (-50 <= float(row.split()[4]) <= 30)]

Edit:

If your data is actually in a NumPy array, which your comment seems to indicate even if your post didn't:

goodrows = [row for row in data if row[4] != 0]

or

goodrows = [row for row in data if not (-50 <= row[4] <= 30)]

should work. There is definitely a NumPy internal way to do this though.

edited Aug 09 '11 at 15:45

answered Aug 09 '11 at 00:30

agf

171,228
44
289
238

I've just tested this to see if they are identical: they're not. `int(row.split()[4])` `raise`s when it encounters `-117.00`. That may explain the -1... – johnsyweb Aug 09 '11 at 01:27
@Johnsyweb absolutely right, good catch. +1 to your answer. Note: I was not one of the downvoters. – agf Aug 09 '11 at 01:32
I get 'AttributeError: 'numpy.ndarray' object has no attribute 'split'' error with this one too. – Chad Aug 09 '11 at 15:33
Ok, if it's already in an array, not in a list of strings in a file, just do `row[4]`. See my edit. Next time, make sure to say in your question if the data is in a numPy array. We all assumed it was in a file in the format you posted. – agf Aug 09 '11 at 15:44

score 2 · Answer 2 · answered Aug 09 '11 at 01:12

2

you can use numpy to do this quickly:

data="""
0  4028.44  4544434.50    -6.76  -117.00  0.0002   0.12
0  4028.50  3455014.50    -5.86  0        0.0003   0.39
0  7028.56  4523434.50    -4.95  -137.00  0.0005   0.25
0  8828.62  4543414.50    -3.05  0        0.0021   0.61
0  4028.44  4544434.50    -6.76  -107.00  0.0002   0.12
0  4028.50  3455014.50    -5.86  -11.00   0.0003   0.39
0  7028.56  4523434.50    -4.95  -127.00  0.0005   0.25
0  8828.62  4543414.50    -3.05  0        0.0021   0.61
"""
from StringIO import StringIO
import numpy as np
d = np.loadtxt(StringIO(data)) # load the text in to a 2d numpy array

print d[d[:,4]!=0]  # choose column 5 != 0
print d[(d[:,4]>=50)|(d[:,4]<=-30)] # choose column 5 >=50 or <=-30

answered Aug 09 '11 at 01:12

HYRY

94,853
25
187
187

2

I don't know if numpy is the right tool as it's not on std library... A list comprehension seems better – JBernardo Aug 09 '11 at 01:33
I got this error: File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/numpy/lib/npyio.py", line 796, in loadtxt items = [conv(val) for (conv, val) in zip(converters, vals)] ValueError: could not convert string to float: [[ – Chad Aug 09 '11 at 15:34
the program above can only convert numbers split by space. From the error message, it seems that you are trying some other data format. – HYRY Aug 09 '11 at 21:33

johnsyweb · Answer 3 · 2011-08-09T22:16:07.810

1

Assuming your data is in a plain text file like this:

$ cat data.txt 
0  4028.44  4544434.50    -6.76  -117.00  0.0002   0.12
0  4028.50  3455014.50    -5.86  0        0.0003   0.39
0  7028.56  4523434.50    -4.95  -137.00  0.0005   0.25
0  8828.62  4543414.50    -3.05  0        0.0021   0.61
0  4028.44  4544434.50    -6.76  -107.00  0.0002   0.12
0  4028.50  3455014.50    -5.86  -11.00   0.0003   0.39
0  7028.56  4523434.50    -4.95  -127.00  0.0005   0.25
0  8828.62  4543414.50    -3.05  0        0.0021   0.61

And you are not using any external libraries. The following will read the data into a list of strings, omiting the undesirable lines. You can feed these lines into any other function you choose. I call print merely to demonstrate. N.B: The fifth column has index '4', since list indices are zero-based.

$ cat data.py 
#!/usr/bin/env python

print "1. Delete the rows which have '0' as a value on 5th column:"

def zero_in_fifth(row):
    return row.split()[4] == '0'

required_rows = [row for row in open('./data.txt') if not zero_in_fifth(row)]
print ''.join(required_rows)

print '2. Choose the range (i.e. remove the rows which have values between -50 and 30 on 5th column):'

def should_ignore(row):
    return -50 <= float(row.split()[4]) <= 30

required_rows = [row for row in open('./data.txt') if not should_ignore(row)]
print ''.join(required_rows)

When you run this you will get:

$ python data.py 
1. Delete the rows which have '0' as a value on 5th column:
0  4028.44  4544434.50    -6.76  -117.00  0.0002   0.12
0  7028.56  4523434.50    -4.95  -137.00  0.0005   0.25
0  4028.44  4544434.50    -6.76  -107.00  0.0002   0.12
0  4028.50  3455014.50    -5.86  -11.00   0.0003   0.39
0  7028.56  4523434.50    -4.95  -127.00  0.0005   0.25

2. Choose the range (i.e. remove the rows which have values between -50 and 30 on 5th column):
0  4028.44  4544434.50    -6.76  -117.00  0.0002   0.12
0  7028.56  4523434.50    -4.95  -137.00  0.0005   0.25
0  4028.44  4544434.50    -6.76  -107.00  0.0002   0.12
0  7028.56  4523434.50    -4.95  -127.00  0.0005   0.25

edited Aug 09 '11 at 22:16

answered Aug 09 '11 at 00:46

johnsyweb

136,902
23
188
247

Don't you think `lambda`s are overkill for this? – agf Aug 09 '11 at 00:51
What's the point of naming a lamda function? That's just wrong. Just use the `def` keyword. – JBernardo Aug 09 '11 at 01:18
@JBernardo: A named function would probably be better, you're right. I just extracted the `lambda` from the generator expression to reduce the line-length. – johnsyweb Aug 09 '11 at 01:29
1

As said above, that's not the place to use `lambda`s. Wrong in many levels. Try reading [that](http://stackoverflow.com/questions/1892324/why-program-functionally-in-python/1892614#1892614)... – Aug 09 '11 at 01:29
@Franklin: Thanks to you and JBernardo, that is an very interesting read. I've updated my answer, accordingly. Perhaps the voters will follow suit? – johnsyweb Aug 09 '11 at 01:40
I get this error: AttributeError: 'numpy.ndarray' object has no attribute 'split', line 10, in required_rows = (row for row in data.split('\n') if not should_ignore(row)) – Chad Aug 09 '11 at 02:10
@Chad: Erm... My answer does not use `numpy.ndarray`, or even `numpy`. Is that how you're storing `data` (I had taken it to be a `string`)? – johnsyweb Aug 09 '11 at 03:28
actually the data comes from a netcdf file , so I load it with the netCDF4 module – Chad Aug 09 '11 at 06:14
@Chad: Then you probably want to remove the calls to `.split()` and `.split('\n')`. – johnsyweb Aug 09 '11 at 06:18
@Chad: Or use the `numpy` answer ;-) – johnsyweb Aug 09 '11 at 06:48
1

@Johnsyweb: I loaded the data from a text file via pylab.loadtxt and try your code but I got the same error with the same line. what am I missing here? – Chad Aug 09 '11 at 15:29
You are missing information about the shape of your data and the third-party libraries you are using from your question. I have re-done my answer to show code that will work without any such libraries. – johnsyweb Aug 09 '11 at 22:18

Eliminating rows with a specific value in a column using Python

3 Answers3