Use Python to select rows with a particular range of values in one column

Question

I know this is simple, but I'm a new user to Python so I'm having a bit of trouble here. I'm using Python 3 by the way.

I have multiple files that look something like this:

NAME DATE AGE SEX COLOR

Name Date Age Sex Color
Ray  May  25.1 M  Gray
Alex Apr  22.3 F  Green
Ann  Jun  15.7 F  Blue

(Pretend this is tab delimited. I should add that the real file will have about 3,000 rows and 17-18 columns)

What I want to do is select all the rows which have a value in the age column which is less than 23.

In this example, the output would be:

Name Date Age Sex Color
Alex Apr  22.3 F  Green
Ann  Jun  15.7 F  Blue

Here's what I tried to do:

f = open("addressbook1.txt",'r')
line = f.readlines()
file_data =[line.split("\t")]
f.close()

for name, date, age, sex, color in file_data:
    if age in line_data < 23:
        g = open("college_age.txt",'a')
        g.write(line)
    else:
        h = open("adult_age.txt",'a')
        h.write(line)

Now, ideally, I have 20-30 of these "addressbook" inputfiles and I wanted this script to loop through them all and add all the entries with an age under 23 to the same output file ("college_age.txt"). I really don't need to keep the other lines, but I didn't know what else to do with them.

This script, when I run it, generates an error.

AttributeError: 'list' object has no attribute 'split'

Then I change the third line to:

file_data=[line.split("\t") for line in f.readlines()]

And it no longer gives me an error, but simply does nothing at all. It just starts and then starts.

Any help? :) Remember I'm dumb with Python.

I should have added that my actual data has decimals and are not integers. I have edited the data above to reflect that.

You might want to check out [the `with` statement](http://docs.python.org/reference/compound_stmts.html#the-with-statement) for [opening files](http://docs.python.org/whatsnew/2.5.html#pep-343-the-with-statement). It's not only more pythonic and readable but handles closing for you, even when exceptions occur. — Gareth Latty, Apr 27 '12 at 21:58

Gareth Latty · Accepted Answer · 2012-04-27T22:40:52.077

5

The issue here is that you are using readlines() twice, which means that the data is read the first time, then nothing is left the second time.

You can iterate directly over the file without using readlines() - in fact, this is the better way, as it doesn't read the whole file in at once.

While you could do what you are trying to do by using str.split() as you have, the better option is to use the csv module, which is designed for the task.

import csv

with open("addressbook1.txt") as input, open("college_age.txt", "w") as college, open("adult_age.txt", "w") as adult:
   reader = csv.DictReader(input, dialect="excel-tab")
   fieldnames = reader.fieldnames
   writer_college = csv.DictWriter(college, fieldnames, dialect="excel-tab")
   writer_adult = csv.DictWriter(adult, fieldnames, dialect="excel-tab")
   writer_college.writeheader()
   writer_adult.writeheader()
   for row in reader:
       if int(row["Age"]) < 23:
          writer_college.writerow(row)
       else:
          writer_adult.writerow(row)

So what are we doing here? First of all we use the with statement for opening files. It's not only more pythonic and readable but handles closing for you, even when exceptions occur.

Next we create a DictReader that reads rows from the file as dictionaries, automatically using the first row as the field names. We then make writers to write back to our split files, and write the headers in. Using the DictReader is a matter of preference. It's generally used more where you access the data a lot (and when you don't know the order of the columns), but it makes the code nice a readable here. You could, however, just use a standard csv.reader().

Next we loop through the rows in the file, checking the age (which we convert to an int so we can do a numerical comparison) to know what file to write to. The with statement closes out files for us.

For multiple input files:

import csv

fieldnames = ["Name", "Date", "Age", "Sex", "Color"]
filenames = ["addressbook1.txt", "addressbook2.txt", ...]

with open("college_age.txt", "w") as college, open("adult_age.txt", "w") as adult:
   writer_college = csv.DictWriter(college, fieldnames, dialect="excel-tab")
   writer_adult = csv.DictWriter(adult, fieldnames, dialect="excel-tab")
   writer_college.writeheader()
   writer_adult.writeheader()
   for filename in filenames:
       with open(filename, "r") as input:
           reader = csv.DictReader(input, dialect="excel-tab")
           for row in reader:
               if int(row["Age"]) < 23:
                  writer_college.writerow(row)
               else:
                  writer_adult.writerow(row)

We just add a loop in to work over multiple files. Please note that I also added a list of field names. Before I just used the fields and order from the file, but as we have multiple files, I figured it would be more sensible to do that here. An alternative would be to use the first file to get the field names.

edited Apr 27 '12 at 22:40

answered Apr 27 '12 at 22:07

Gareth Latty

86,389
17
178
183

Ok, this csv looks like a natural tool. But how would the "if int(row["Age']) < 23:" line change since my data actually has decimals? I've updated my original post to reflect that, sorry for not adding that in earlier. – Brandon Apr 27 '12 at 22:15
@Brandon Just change it to `float(row["Age"])`. It's a floating point number, so that's what you need to store. – Gareth Latty Apr 27 '12 at 22:16
Ok this is fantastic. 2 things that would be a little helpful: For some reason, it adds an extra line break between each line in the outputs. Secondly, is there some way that I could easily adjust this to loop this through a bunch of files identical in structure to this but add each successive output to the file (so only one college_age file and one adult_age file but that has data from multiple inputs all concatenated there). Does my question make sense or is this too hard to do? – Brandon Apr 27 '12 at 22:27
@Brandon I added an example for multiple inputs. I do not, however, get your behaviour of extra lines between rows. As I can't reproduce this behaviour, I can't really say why it's happening. – Gareth Latty Apr 27 '12 at 22:44
2

@Brandon: At a guess, the newlines are because you're opening the file in mode `w`. The `csv` module has a quirk where you need to open the file in `wb` mode on Windows. (Lattyware may not have has this issue if he's on Linux.) – Li-aung Yip Apr 28 '12 at 11:26
@Li-aungYip Excellent catch, I am indeed under Linux. – Gareth Latty Apr 28 '12 at 11:26
It's bitten me before and the symptoms sounded familiar. (why on earth does the `csv` module consider a csv file to be a binary format?!) Also, congrats on 5k rep. ;) – Li-aung Yip Apr 28 '12 at 11:29
@Li-aungYip: Don't blame the csv module. It's a quirk of the CSV format that the line terminator is specified to be CR LF. This means that if you don't open the output file in binary mode, the Windows runtime will translate that CR LF into CR LF LF. – John Machin Apr 28 '12 at 11:33
Ahh. That makes sense, I will open it in wb mode then. – Brandon Apr 29 '12 at 01:31
I tried to use this one but came up with a this error: TypeError: 'str' does not support the buffer interface – Brandon Apr 30 '12 at 18:41
@Brandon You need to [add `newline=''` to `open()` under Python 3.x](http://stackoverflow.com/questions/7200606/python3-writing-csv-files). – Gareth Latty Apr 30 '12 at 18:43
@Lattyware I tried that but now got ValueError: binary mode doesn't take a newline argument – Brandon Apr 30 '12 at 18:50
@Brandon Try not making it binary, but adding in the newline option. – Gareth Latty Apr 30 '12 at 19:11
@Lattyware Ah hah, I figured it out. Now I changed the open from: open('blbhblah', 'wb', newline='') to open('blahblah', 'w', newline=''). Thanks again for all your help! Ah...for whatever reason I didn't see your comment until now :O – Brandon Apr 30 '12 at 19:16

score 0 · Answer 2 · answered Apr 27 '12 at 22:04

0

I think it is better to use csv module for reading such files http://docs.python.org/library/csv.html

answered Apr 27 '12 at 22:04

marwinXXII

1,456
14
21

glglgl · Answer 3 · 2012-04-28T13:12:30.950

-3

ITYM

with open("addressbook1.txt", 'r') as f:
    # with automatically closes
    file_data = ((line, line.split("\t")) for line in f)
    with open("college_age.txt", 'w') as g, open("adult_age.txt", 'w') as h:
        for line, (name, date, age, sex, color) in file_data:
            if int(age) < 23: # float() if it is not an integer...
                g.write(line)
            else:
                h.write(line)

It might look like the file data is iterated through several times. But thanks to the generator expression, file data is just a generator handing out the next line of the file if asked to do so. And it is asked to do so in the for loop. That means, every item retrieved by the for loop comes from the generator file_data where on request each file line gets transformed into a tuple holding the complete line (for copying) as well as its components (for testing).

An alternative could be

file_data = ((line, line.split("\t")) for line in iter(f.readline, ''))

it is closer to readlines() than iterating over the file. As readline() acts behind the scenes slightly different from iteration over the file, it might be necessary to do so.

(If you don't like functional programming, you as well could create a generator function manually calling readline() until an empty string is returned.

And if you don't like nested generators at all, you can do

with open("addressbook1.txt", 'r') as f, open("college_age.txt", 'w') as g, open("adult_age.txt", 'w') as h:
    for line in f:
        name, date, age, sex, color = line.split("\t")
        if int(age) < 23: # float() if it is not an integer...
            g.write(line)
        else:
            h.write(line)

which does exactly the same.)

edited Apr 28 '12 at 13:12

answered Apr 27 '12 at 22:08

glglgl

89,107
13
149
217

``if age in line_data < 23:``? What? – Gareth Latty Apr 27 '12 at 22:15
But everything else was correct. Was that really worth a downvote?! – glglgl Apr 28 '12 at 06:02
-1 You traverse the data THREE times instead of ONE. That is worth two downvotes. – John Machin Apr 28 '12 at 07:52
@JohnMachin But it is by far better than the original one. My intention was to help at that particular problem, not to provide a perfect solution... – glglgl Apr 28 '12 at 10:54
@glglgl: I was seeking avoidance of gross waste, not perfection. Your revised version traverses the file TWICE instead of ONCE. That is better but still wasteful and certainly not "perfect". Your `for line in iter(f.readline, '')` is not a reasonable alternative to the common and simple idiom `for line in f`; it's an obfuscation. And you are missing a `)` – John Machin Apr 28 '12 at 11:59
@JohnMachin I do not see the "twice". I use a generator expression, not a list comprehension. – glglgl Apr 28 '12 at 12:40
`readline()` does, behind the scenes, behave distict from the iteration over the file. That's why I provided the alternative. – glglgl Apr 28 '12 at 12:43

Use Python to select rows with a particular range of values in one column

3 Answers3

For multiple input files:

Linked