1

I am working with datasets stored in large text files. For the analysis I am carrying out, I open the files, extract parts of the dataset and compare the extracted subsets. My code works like so:

from math import ceil

with open("seqs.txt","rb") as f:
    f = f.readlines()

assert type(f) == list, "ERROR: file object not converted to list"

fives = int( ceil(0.05*len(f)) ) 
thirds = int( ceil(len(f)/3) )

## top/bottom 5% of dataset
low_5=f[0:fives]
top_5=f[-fives:]

## top/bottom 1/3 of dataset
low_33=f[0:thirds]
top_33=f[-thirds:]

## Write lists to file
# top-5
with open("high-5.out","w") as outfile1:
   for i in top_5:
       outfile1.write("%s" %i)
# low-5
with open("low-5.out","w") as outfile2:
    for i in low_5:
        outfile2.write("%s" %i)
# top-33
with open("high-33.out","w") as outfile3:
    for i in top_33:
        outfile3.write("%s" %i)
# low-33        
with open("low-33.out","w") as outfile4:
    for i in low_33:
        outfile4.write("%s" %i)

I am trying to find a more clever way of automating the process of writing the lists out to files. In this case there are only four, but in the future cases where I may end up with as many as 15-25 lists I would some function to take care of this. I wrote the following:

def write_to_file(*args):
    for i in args:
        with open(".out", "w") as outfile:
            outfile.write("%s" %i)

but the resulting file only contains the final list when I call the function like so:

write_to_file(low_33,low_5,top_33,top_5)

I understand that I have to define an output file for each list (which I am not doing in the function above), I'm just not sure how to implement this. Any ideas?

mdml
  • 22,442
  • 8
  • 58
  • 66
Spyros
  • 249
  • 1
  • 3
  • 15
  • 1
    Don't forget to mark an answer as correct so that people looking at this question in the future will know what worked best! – NDevox May 28 '15 at 13:59
  • Going through all answers to come up with one, so many of them work out – Spyros May 28 '15 at 14:19
  • @PM 2Ring: Agreed, very nice to have so much input - informative for a python novice like me. Will not forget to mark an answer as accepted so that the query registers as "resolved". – Spyros May 28 '15 at 16:47

5 Answers5

1

You could have one output file per argument by incrementing a counter for each argument. For example:

def write_to_file(*args):
    for index, i in enumerate(args):
        with open("{}.out".format(index+1), "w") as outfile:
           outfile.write("%s" %i)

The example above will create output files "1.out", "2.out", "3.out", and "4.out".

Alternatively, if you had specific names you wanted to use (as in your original code), you could do something like the following:

def write_to_file(args):
    for name, data in args:
        with open("{}.out".format(name), "w") as outfile:
            outfile.write("%s" % data)

args = [('low-33', low_33), ('low-5', low_5), ('high-33', top_33), ('high-5', top_5)]
write_to_file(args)

which would create output files "low-33.out", "low-5.out", "high-33.out", and "high-5.out".

mdml
  • 22,442
  • 8
  • 58
  • 66
1

Make your variable names match your filenames and then use a dictionary to hold them instead of keeping them in the global namespace:

data = {'high_5': # data
       ,'low_5': # data
       ,'high_33': # data
       ,'low_33': # data}

for key in data:
    with open('{}.out'.format(key), 'w') as output:
        for i in data[key]:
            output.write(i)

Keeps your data in a single easy to use place, and assuming you want to apply the same actions to them you can continue using the same paradigm.

As mentioned by PM2Ring below, it would be advisable to use underscores (as you do in the variable names) instead of dashes(as you do in the filenames) as by doing so you can pass the dictionary keys as keyword arguments into a writing function:

write_to_file(**data)

This would equate to:

write_to_file(low_5=f[:fives], high_5=f[-fives:],...) # and the rest of the data

From this you could use one of the functions defined by the other answers.

NDevox
  • 4,056
  • 4
  • 21
  • 36
  • 2
    I'd be tempted to use the underscore versions for the file names / dict keys. That way, you can put the processing loop into a function like `write_to_file(**kwargs)`, which can be called like `write_to_file(**data)` but also like `write_to_file(low_5=f[:fives], high_5=f[-fives:])`. You can't do that with key names like `high-5`, since only valid identifier names can be used for the keywords in keyword args. – PM 2Ring May 28 '15 at 13:19
  • That's definitely a good point, and with retrospect I would do the same. – NDevox May 28 '15 at 13:23
  • Haha, I should update it for knowledge sake... Done. – NDevox May 28 '15 at 13:42
  • Thanks all for the answers, they all did more or less what I was hoping to accomplish! – Spyros Jun 01 '15 at 14:57
1

Don't try to be clever. Instead aim to have your code readable, easy to understand. You can group repeated code into a function, for example:

from math import ceil

def save_to_file(data, filename):
    with open(filename, 'wb') as f:
        for item in data:
            f.write('{}'.format(item))

with open('data.txt') as f:
    numbers = list(f)

five_percent = int(len(numbers) * 0.05)
thirty_three_percent = int(ceil(len(numbers) / 3.0))
# Why not: thirty_three_percent = int(len(numbers) * 0.33)
save_to_file(numbers[:five_percent], 'low-5.out')
save_to_file(numbers[-five_percent:], 'high-5.out')
save_to_file(numbers[:thirty_three_percent], 'low-33.out')
save_to_file(numbers[-thirty_three_percent:], 'high-33.out')

Update

If you have quite a number of lists to write, then it makes sense to use a loop. I suggest to have two functions: save_top_n_percent and save_low_n_percent to help with the job. They contain a little duplicated code, but by separating them into two functions, it is clearer and easier to understand.

def save_to_file(data, filename):
    with open(filename, 'wb') as f:
        for item in data:
            f.write(item)

def save_top_n_percent(n, data):
    n_percent = int(len(data) * n / 100.0)
    save_to_file(data[-n_percent:], 'top-{}.out'.format(n))

def save_low_n_percent(n, data):
    n_percent = int(len(data) * n / 100.0)
    save_to_file(data[:n_percent], 'low-{}.out'.format(n))

with open('data.txt') as f:
    numbers = list(f)

for n_percent in [5, 33]:
    save_top_n_percent(n_percent, numbers)
    save_low_n_percent(n_percent, numbers)
Hai Vu
  • 37,849
  • 11
  • 66
  • 93
  • Sure, it's _not_ a good idea to write code that's so clever that you can't read it or debug it. OTOH, it _is_ a good idea to write your code in such a way that you don't need to be [overly repetitive](http://en.wikipedia.org/wiki/Don%27t_repeat_yourself). Of course, what counts as _overly_ repetitive can be a matter of personal taste, but I think that in this case processing a list (or a dict) using a `for` loop would be preferable to multiple explicit function calls. True, the code in the question only requires 4 calls, but Spyros does mention that he may end up with 15-25 lists to write. – PM 2Ring May 28 '15 at 13:34
  • @HaiVu right, when I said clever I meant a way that would be less repetitive and more generic for any amount of I/O operations I needed to execute. Thank you for the answer! – Spyros May 28 '15 at 14:00
  • 1
    @PM2Ring - You are right. I missed part of the original code. For that, I updated my solution. – Hai Vu May 28 '15 at 14:14
0

On this line you are opening up a file called .out each time and writing to it.

with open(".out", "w") as outfile:

You need to make the ".out" unique for each i in args. you can achieve this by passing in a list as the args and the list will contain the file name and data.

def write_to_file(*args):
    for i in args:
        with open("%s.out" % i[0], "w") as outfile:
            outfile.write("%s" % i[1])

And pass in arguments like so...

write_to_file(["low_33",low_33],["low_5",low_5],["top_33",top_33],["top_5",top_5])
Songy
  • 851
  • 4
  • 17
  • OP is writing to an original file each time, none of them are called `'.out'`. – NDevox May 28 '15 at 14:29
  • @Scironic: The OP does open & write a file named '.out' multiple times in the 2nd code block, but in the final paragraph of the question he acknowledges that that's not the right thing to do. – PM 2Ring May 28 '15 at 14:34
0

You are creating a file called '.out' and overwriting it each time.

def write_to_file(*args):
    for i in args:
        filename = i + ".out"
        contents = globals()[i]
        with open(".out", "w") as outfile:
            outfile.write("%s" %contents)


write_to_file("low_33", "low_5", "top_33", "top_5")

https://stackoverflow.com/a/6504497/3583980 (variable name from a string)

This will create low_33.out, low_5.out, top_33.out, top_5.out and their contents will be the lists stored in these variables.

Community
  • 1
  • 1
riddler
  • 467
  • 3
  • 13
  • 1
    Clever, but using `globals()` is rarely a good idea, since it impacts the modularity of the program. – PM 2Ring May 28 '15 at 13:37
  • @PM2Ring why would that be so? The variables are called within the scope of the module. Isn't defining a dictionary for the filenames achieving the same purpose? – riddler May 28 '15 at 13:43
  • I'm with @PM2Ring. There are many reasons not to use global variables. I have been coding for more than 20 years and got bitten by this a quite a few times. If you are asking why, you can google for **global variables bad** to find out why. – Hai Vu May 28 '15 at 13:55
  • @riddler: It's almost always better to use your own dictionary, and avoid directly using `globals()`, if you can. Your approach works ok here because the variables are in the global scope, but what if the OP wants to make the program more modular and tries to move those variables into a function? Then they'll no longer be in the `globals()` dict. – PM 2Ring May 28 '15 at 14:51
  • 1
    @riddler: Also, although many code snippets on SO & elsewhere on the Net put variables & code into the global scope it's not really Best Practice: a proper Python program should avoid cluttering the global scope with variables & code - ideally, the global scope should only contain references to imported objects, function definitions, and constants; any global code should only be there to load values into those constants. And to call the `main() function. :) – PM 2Ring May 28 '15 at 14:51