2

i have this code:

import csv
import collections

def do_work():
      (data,counter)=get_file('thefile.csv')
      b=samples_subset1(data, counter,'/pythonwork/samples_subset3.csv',500)
      return

def get_file(start_file):

        with open(start_file, 'rb') as f:
            data = list(csv.reader(f))
            counter = collections.defaultdict(int)

            for row in data:
              counter[row[10]] += 1
            return (data,counter)

def samples_subset1(data,counter,output_file,sample_cutoff):

      with open(output_file, 'wb') as outfile:
          writer = csv.writer(outfile)
          b_counter=0
          b=[]
          for row in data:
              if counter[row[10]] >= sample_cutoff:
                 b.append(row) 
                 writer.writerow(row)
                 b_counter+=1
      return (b)

i recently started learning python, and would like to start off with good habits. therefore, i was wondering if you can help me get started to turn this code into classes. i dont know where to start.

Alex Gordon
  • 57,446
  • 287
  • 670
  • 1,062
  • this guy suggested it: http://stackoverflow.com/questions/3354218/python-proper-usage-of-global-variable/3354537#3354537 – Alex Gordon Jul 28 '10 at 15:42
  • 1
    While there are a lot of nitpicky style suggestions I could make, I agree that a class isn't really necessary for a program this small. – Kenan Banks Jul 28 '10 at 15:45
  • tiptych, can you please make those suggestions, i am very happy to hear them – Alex Gordon Jul 28 '10 at 15:46
  • 1
    Don't bother. Your code looks fine. That guy is right that classes are better than global variables. But there is nothing wrong with this code. – Jason Orendorff Jul 28 '10 at 15:50
  • 1
    @Jason: of course there is nothing wrong with this code it was written by SO! have a peak at the OP's history. – SilentGhost Jul 28 '10 at 15:59

2 Answers2

4

Per my comment on the original post, I don't think a class is necessary here. Still, if other Python programmers will ever read this, I'd suggest getting it inline with PEP8, the Python style guide. Here's a quick rewrite:

import csv
import collections

def do_work():
    data, counter = get_file('thefile.csv')
    b = samples_subset1(data, counter, '/pythonwork/samples_subset3.csv', 500)

def get_file(start_file):
    with open(start_file, 'rb') as f:
        counter = collections.defaultdict(int)
        data = list(csv.reader(f))

        for row in data:
            counter[row[10]] += 1

    return (data, counter)

def samples_subset1(data, counter, output_file, sample_cutoff):
    with open(output_file, 'wb') as outfile:
        writer = csv.writer(outfile)
        b = []
        for row in data:
            if counter[row[10]] >= sample_cutoff:
                b.append(row) 
                writer.writerow(row)

    return b

Notes:

  1. No one uses more than 4 spaces to indent ever. Use 2 - 4. And all your levels of indentation should match.
  2. Use a single space after the commas between arguments to functions ("F(a, b, c)" not "F(a,b,c)")
  3. Naked return statements at the end of a function are meaningless. Functions without return statements implicitly return None
  4. Single space around all operators (a = 1, not a=1)
  5. Do not wrap single values in parentheses. It looks like a tuple, but it isn't.
  6. b_counter wasn't used at all, so I removed it.
  7. csv.reader returns an iterator, which you are casting to a list. That's usually a bad idea because it forces Python to load the entire file into memory at once, whereas the iterator will just return each line as needed. Understanding iterators is absolutely essential to writing efficient Python code. I've left data in for now, but you could rewrite to use an iterator everywhere you're using data, which is a list.
Kenan Banks
  • 207,056
  • 34
  • 155
  • 173
  • this part does not work for me: for row in csv.reader(f):, it returns nothing – Alex Gordon Jul 28 '10 at 16:18
  • My fault - didn't notice you were returning the `data` variable. That changes things. – Kenan Banks Jul 28 '10 at 16:26
  • 1
    @l__: to get rid of the list cast, just do `data = csv.reader(f)`. Then, as well as passing `data` to `samples_subset1`, pass `f` and call `f.seek(0)` before iterating through it again. That alone should be a huge performance increase. – Katriel Jul 28 '10 at 16:43
  • katriealex, can u give me an example of how i would pass f.seek(0) ?? – Alex Gordon Jul 28 '10 at 16:46
  • File "C:\pythonwork\readthefile072810.py", line 6, in do_work b=samples_subset1(data,counter,'/pythonwork/samples_subset4.csv',500,f) File "C:\pythonwork\readthefile072810.py", line 23, in samples_subset1 f.seek(0) ValueError: I/O operation on closed file – Alex Gordon Jul 28 '10 at 16:49
  • what do i do with f in samples_subset1 – Alex Gordon Jul 28 '10 at 16:52
  • Oh, I see. Basically, what you did before was to read the file into a list in memory and then close the file. You don't have to read it all in at once -- but then, you can't close the file. The `with` statement automagically closes f when it terminates, which in this case is bad. I'll add a new post with a rewrite. – Katriel Jul 28 '10 at 18:06
2

Well, I'm not sure what you want to turn into a class. Do you know what a class is? You want to make a class to represent some type of thing. If I understand your code correctly, you want to filter a CSV to show only those rows whose row[ 10 ] is shared by at least sample_cutoff other rows. Surely you could do that with an Excel filter much more easily than by reading through the file in Python?

What the guy in the other thread suggested is true, but not really applicable to your situation. You used a lot of global variables unnecessarily: if they'd been necessary to the code you should have put everything into a class and made them attributes, but as you didn't need them in the first place, there's no point in making a class.

Some tips on your code:

  • Don't cast the file to a list. That makes Python read the whole thing into memory at once, which is bad if you have a big file. Instead, simply iterate through the file itself: for row in csv.reader(f): Then, when you want to go through the file a second time, just do f.seek(0) to return to the top and start again.

  • Don't put return at the end of every function; that's just unnecessary. You don't need parentheses, either: return spam is fine.

Rewrite

import csv
import collections

def do_work():
    with open( 'thefile.csv' ) as f:
        # Open the file and count the rows.
        data, counter = get_file(f)
        
        # Go back to the start of the file.
        f.seek(0)

        # Filter to only common rows.
        b = samples_subset1(data, counter, 
            '/pythonwork/samples_subset3.csv', 500)
   
     return b

def get_file(f):
    counter = collections.defaultdict(int)
    data = csv.reader(f)
    
    for row in data:
        counter[row[10]] += 1

    return data, counter

def samples_subset1(data, counter, output_file, sample_cutoff):
    with open(output_file, 'wb') as outfile:
        writer = csv.writer(outfile)
        b = []
        for row in data:
            if counter[row[10]] >= sample_cutoff:
                b.append(row) 
                writer.writerow(row)

    return b
Community
  • 1
  • 1
Katriel
  • 120,462
  • 19
  • 136
  • 170
  • thank you. excel is not the right tool for this job. vba is a piece of garbage compared to python :) – Alex Gordon Jul 28 '10 at 15:51
  • 3
    Agree 100%. What you've done so far coming from your original code is to create functions that nicely react only to what's given to them as inputs, and have an effect only to what they're going to return as values. This, in itself, is a Good Thing ;) Making this a class reintroduces the danger of manipulating variables in a less obvious fashion than you're doing now, while the complexity of the state isn't high enough that you'd benefit at the same time. – Nicolas78 Jul 28 '10 at 15:51
  • Are you sure Excel isn't the right tool for the job? I don't mean that you should write a macro in VBA (ugh!!!)! You're doing data filtering: you want to show row n only if `=COUNTIF(K1:K.,Kn)>=sample_cutoff`. That's built into Excel. – Katriel Jul 28 '10 at 16:05
  • im sure it is, as ive had lots of experience with it – Alex Gordon Jul 28 '10 at 16:07
  • Did you actually read what I wrote? I'll tell you: Excel does this, and it does it very simply. Step 1: make a new column in your CSV, with formula `=COUNTIF(K1:K.,Kn)>=sample_cutoff`. Step 2: Data->AutoFilter. Step 3: In the top of the new column, click the drop-down menu button and select TRUE. You're done, no Python involved. – Katriel Jul 28 '10 at 16:12
  • yes you are right if it were just that function then excel would be my choice, but there are a bunch of other steps i need ot do after this – Alex Gordon Jul 28 '10 at 16:13
  • btw my file is 100megs, it would be difficult to do anything with in excel with a file that big without crashing my computer – Alex Gordon Jul 28 '10 at 16:14
  • OK, sorry. Just wanted to make sure. – Katriel Jul 28 '10 at 16:40
  • please see comments to your suggestions above this questino – Alex Gordon Jul 28 '10 at 16:53