0

I have a CSV of items and values, and a representation of it would look something like this:

foo, 569
bar, 9842
asdasd, 98
poiqweu, 7840
oiasd, 4
poeri, 145
sacodiw, 55
aosdwr, 855
9523, 60
a52sd, 5500
sdcw, 415
0932, 317

I want to export to three CSVs such that they receive items from the master CSV in the order: highest, lowest, next highest, next lowest, etc.

CSV1 should be:

bar, 9842
oiasd, 4
poiqweu, 7840
sacodiw, 55

And so on for the other two CSVs.

For bonus, what I really want to do is create three CSVs of 90 items each from a master of 270, such that each of the three are as close to the same sum of values as each other as possible. I assume there's a better way than my simplistic (and highly assumptive) method.

How would I go about this in my python script that I'm already using (which includes both CSV and pandas, if the latter is any help)?

Will
  • 11,276
  • 9
  • 68
  • 76
Xodarap777
  • 1,358
  • 4
  • 19
  • 42
  • 1) Make a list of tuples. 2) Sort by the second entry in the tuple (the number) 3) Write out i, -i, i+1,-i-1, i+2, -i-2, etc from your sorted tuple list – Cory Kramer Mar 07 '14 at 00:44
  • 2
    This sounds like the partition problem (but 3-way instead of 2-way). This is NP-complete (the only way to get the guaranteed best solution is to try all combinations), but there are various heuristics which should give a good approximation. Edit: see http://en.wikipedia.org/wiki/Partition_problem – Hugh Bothwell Mar 07 '14 at 00:45
  • Is it required that the three subsets are of the same size? – Steinar Lima Mar 07 '14 at 01:02
  • 2
    You might find [this page](http://en.wikipedia.org/wiki/Partition_problem#Approximation_algorithm_approaches) on the partition problem to be enlightening. – jme Mar 07 '14 at 01:33
  • Thank you, jme and Hugh: this led me to my own answer in only a few minutes using the greedy algorithm. – Xodarap777 Mar 07 '14 at 01:52

4 Answers4

3

You can use the following building blocks to solve the problem (it shouldn't be hard to take it from here):

Use pandas to load and sort:

import pandas as pd
original = pd.read_csv('test.csv', names=['name','count'])
df_highest_first  = df.sort(columns=['count'])
df_smallest_first = df.sort(columns=['count'], ascending=False)

largest_1 = df_largest['count'][0:-1:2].values
largest_2 = df_largest['count'][1:-2:2].values

smallest_1 = df_smallest['count'][0:-1:2].values
smallest_2 = df_smallest['count'][1:-2:2].values

and then izip to interleave elements between pairs of lists:

result = list(chain.from_iterable(izip(list_a, list_b)))
Community
  • 1
  • 1
user30802
  • 53
  • 4
2

Here is a partial solution;

reorder is functional, but as I am not very familiar with pandas I have just used Python's built-in data structures.

Edit: I've replaced partition_by_sum with a greedy implementation; it tries to find equal sums, but pays no attention to the number of items per bin. Suggestions for a better algorithm?

This should give you a pretty good head-start.

from collections import defaultdict
import csv

VALUE_COL = 1
NUM_BINS = 3

inp = [
    ["foo",      569],
    ["bar",     9842],
    ["asdasd",    98],
    ["poiqweu", 7840],
    ["oiasd",      4],
    ["poeri",    145],
    ["sacodiw",   55],
    ["aosdwr",   855],
    ["9523",      60],
    ["a52sd",   5500],
    ["sdcw",     415],
    ["0932",     317]
]

def load_csv(fname, **kwargs):
    with open(fname, "rb") as inf:
        for row in csv.reader(inf, **kwargs):
            yield row

def save_csv(fname, rows, **kwargs):
    with open(fname, "wb") as outf:
        csv.writer(outf, **kwargs).writerows(rows)

def make_index(lst, col):
    """
    Index a table by column;
    return list of column-values and dict of lists of rows having that value
    """
    values, index = [], defaultdict(list)
    for row in lst:
        val = row[col]
        values.append(val)
        index[val].append(row)
    return values, index

def min_index(lst):
    """
    Return index of min item in lst
    """
    return lst.index(min(lst))

def partition_by_sum(values, num_bins, key=None):
    """
    Try to partition values into lists having equal sum

    Greedy algorithm, per http://en.wikipedia.org/wiki/Partition_problem#Approximation_algorithm_approaches
    """
    values.sort(key=key, reverse=True)   # sort descending
    bins = [[] for i in xrange(num_bins)]
    sums = [0] * num_bins
    for value in values:
        index = min_index(sums)
        bins[index].append(value)
        sums[index] += value
    return bins

def reorder(lst, key=None):
    """
    Return [highest, lowest, second-highest, second-lowest, ...]
    """
    lst.sort(key=key, reverse=True)    # sort in descending order
    halflen = (len(lst) + 1) // 2      # find midpoint
    highs, lows = lst[:halflen], lst[halflen:][::-1]   # grab [high half descending], [low half ascending]
    lst[0::2], lst[1::2] = highs, lows                 # reassemble
    return lst

def main():
    # load data
    data = inp    # load_csv("input_file.csv")

    # solve partitioning    
    values, index = make_index(data, VALUE_COL)
    bins = partition_by_sum(values, NUM_BINS)

    # rearrange for output
    bins = [[index[val].pop() for val in reorder(bin)] for bin in bins]

    # write output
    for i,bin in enumerate(bins, 1):
        save_csv("output_file_{}.csv".format(i), bin)

if __name__=="__main__":
    main()
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
1

I would have taken this approach, given data with N rows:

  • Sort the input data descending.
  • Create 3 empty lists
  • Iterate over the sorted data, and add the current row to the list with lowest sum, unless this list already has N/3 or more entries

After reading the page regarding the partition problem on wikipedia, I see that this algorithm is an adaption of the greedy algorithm, the only exception is that I require all subsets to have the same length (if N % 3 == 0).

I wrote a simple code snippet that demonstrates it for you. I think this is better way of solving you problem than your proposed solution. As you can see from the output below, the first dataset contains the highest value, and the 3 lowest values. The solution you proposed would have given way bigger differences of the total sums.

import csv

class DataSet:
    def __init__(self, filename):
        self.total = 0
        self.data = []
        self.filename = filename

    def add(self, row):
        self.total += int(row[1])
        self.data.append(row)

    def write(self):
        with open(self.filename, 'wb') as ofile:
            writer = csv.writer(ofile)
            writer.writerows(self.data)

with open('my_data.csv') as ifile:
    data = sorted(csv.reader(ifile), key=lambda l: -int(l[1]))

subsets = DataSet('data_1.csv'), DataSet('data_2.csv'), DataSet('data_3.csv')

for row in data:
    sets = [k for k in subsets if len(k.data) < 4]
    min(sets, key=lambda x: x.total).add(row)

for k in subsets:
    print k.data, k.total
    k.write()

Output:

[['bar', ' 9842'], ['9523', ' 60'], ['sacodiw', ' 55'], ['oiasd', ' 4']] 9961
[['poiqweu', ' 7840'], ['0932', ' 317'], ['poeri', ' 145'], ['asdasd', ' 98']] 8400
[['a52sd', ' 5500'], ['aosdwr', ' 855'], ['foo', ' 569'], ['sdcw', ' 415']] 7339
Steinar Lima
  • 7,644
  • 2
  • 39
  • 40
  • I love seeing the same basic idea of what I pieced together with no coding knowledge - but written correctly. This is the kind of thing you just couldn't get in a class! The best part of this is that I can adapt this code much more easily, and I could replace the "4" in sets with a variable that splits whatever I need. – Xodarap777 Mar 07 '14 at 02:07
  • 1
    @Xodarap777 Good to hear, your coding practice will only get better with practice (and by reading the docs!). Please consider accepting this answer if it answers your question :) – Steinar Lima Mar 07 '14 at 02:09
0

jme and Hugh Bothwell linked me to the Partition Problem, where I could find the Greedy Algorithm, which I quickly adapted in my Python-2.7 for CS101 style of code:

import csv

inf = csv.reader(open('ACslist.csv', 'r'))
out1 = csv.writer(open('ACs1.csv', 'wb'))
out2 = csv.writer(open('ACs2.csv', 'wb'))
out3 = csv.writer(open('ACs3.csv', 'wb'))

firstrow = inf.next()
out1.writerow(firstrow)
out2.writerow(firstrow)
out3.writerow(firstrow)

sum1 = 0
sum2 = 0
sum3 = 0

count1 = 0
count2 = 0
count3 = 0


for row in inf:
    row[1] = int(row[1])
    if sum1 == 0:
        out1.writerow(row)
        count1 += 1
        sum1 += row[1]
    elif sum2 == 0:
        out2.writerow(row)
        count2 += 1
        sum2 += row[1]
    elif sum1 < sum2 and sum1 < sum3 and count1 < 90:
        out1.writerow(row)
        count1 += 1
        sum1 += row[1]
    elif sum2 < sum1 and sum2 < sum3 and count2 < 90:
        out2.writerow(row)
        count2 += 1
        sum2 += row[1]
    elif sum3 < sum2 and sum3 < sum1 and count3 < 90:
        out3.writerow(row)
        count3 += 1
        sum3 += row[1]
    elif count1 < 90:
        out1.writerow(row)
        count1 += 1
        sum1 += row[1]
    elif count2 < 90:
        out2.writerow(row)
        count2 += 1
        sum2 += row[1]

print sum1
print sum2
print sum3

My printouts came to this:

122413
122397
122399

Pretty darn close if I do say so myself!

To my very amateur eyes, it seems like a simpler solution. I'm sure I could have written it much more efficiently; if anyone wants to point out my flaws in style, I'd be happy for the help.

Xodarap777
  • 1,358
  • 4
  • 19
  • 42
  • 1
    I've adapted the same algorithm, please have a look at my answer as well :) – Steinar Lima Mar 07 '14 at 01:53
  • 1
    *"if anyone wants to point out my flaws in style"* -- there is a site: http://codereview.stackexchange.com/ where you could post the code that works (as far as you know) but that you would like to get a feedback. – jfs Mar 07 '14 at 01:54
  • Hey, this is how they teach you to use python in intro to programming classes. I have to just adapt from there... :) It may make you cry to see, but the results were perfect - 3 lists of 90 items almost exactly equal in sum. – Xodarap777 Mar 07 '14 at 02:04
  • Oh come on, really? Downvotes for trying to do it myself as a python newbie? It works for its purposes - considering that I know *nothing* of formal python, that's kinda neat, at least... – Xodarap777 Mar 07 '14 at 02:09
  • @Xodarap777 It's of course up to you, but as you state in your comment below my post, *"I love seeing the same basic idea of what I pieced together with no coding knowledge - but written correctly."*, you do prefer my version of yours. Why would you change the accepted answer from mine to yours? – Steinar Lima Mar 28 '14 at 14:36