0

I am very much a novice at Python, but learning. I have been tasked at work to take a CSV of data (2500 rows) in the following format (as we cant figure out how to do it in Excel):

 RefNumber      Reviewer 1  Reviewer 2  Reviewer 3  Reviewer 4  Reviewer 5
  9676/2            6           3           2
  0526/4            6           3           1           5           1
  1842/1            5           3           1           5   
  2693/3            5           5           1           2   
  2515/1            6           3           1           5           3
  2987/1            4           1           3
  3841/1            4           3           1 
  3402/1            4           3           1           5   

And produce a CSV with each average of all the permutations of numbers that you could get from each row (minimum of 3).

i.e.

3841/1 above would produce the tuple of {4,3,1}, and an average of 2.7

3402/1 above would produce the tuples of {4,3,1}, {4,3,1,5}, {3,1,5},{4,1,5} etc with an average of 2.7, 3.3, 3, 3.3 etc.

I am wracking my brain trying to think of the best way of doing this, as I also need to know of each average, how many numbers in the tuple did it contain i.e. {4,3,1} would produce an average of 2.7 and the count of numbers of that tuple is 3.

Essentially what I want to produce is this:

RefNumber      Avg 1     Avg 2       Avg 3       Avg 4   Avg 5
  3841/1        2.7         
  3402/1        2.7       3.3           3         3.5   

But I guess to show the count of the numbers in the tuple, I could run it 9 times (there is a maximum of 12 reviews) and just have each datasheet on its own tab.

I technically also need the standard deviation of each tuple and the range of scores, but this is already going wayyyyy past my expertise so I guess I can maybe drop that or do it manually somehow.

Any idea on where to start with this?

Nicholas
  • 3,517
  • 13
  • 47
  • 86
  • I think `subset` is more appropriate than `permuatation`. The average of a permutated set is constant. – handle May 08 '17 at 13:40
  • Start with reading the data, e.g. with https://docs.python.org/3/library/csv.html – handle May 08 '17 at 13:43
  • Would subsets work? I thought a subset would just produce a set of all the numbers on that row i.e. if the row contained 8 numbers, it would be a subset of 8 numbers whereas I would want all the combinations between 3 and 8? – Nicholas May 08 '17 at 13:52
  • Ahh, yes, I dont want duplications i.e. {4,3,1} is the same as {1,3,4} so I only want one of them. Hmmmm, back to the drawing board :/ – Nicholas May 08 '17 at 14:01
  • Exactly, that is one combination (set) but two permutations (different order of elements). – handle May 08 '17 at 14:02

3 Answers3

3

You can use csv module to read through csv and extract the data and the itertools module to get all the combinations. see if its doing the job. Also I left the average values as it is but I see you are just working with 1 decimal point which you can easily get by rounding off the results. Guess you can save the result now.

from itertools import combinations as cb 
import csv
with open("test.csv") as f:
    reader=csv.reader(f)
    next(reader, None)  # skip header
    data=[filter(None,i) for i in reader]

def avgg(x):
    ll=[float(i) for i in x[1:]] #take review no and convert to float
    n=len(ll)
    avg_list=[x[0]]  #start result list with ref no.
    for i in range(3,n+1):
        for j in cb(ll,i):
            # print(j)  #see the combination
            avg_list.append(sum(j)/i)
    return avg_list

for x in data:
    print(avgg(x))
Eular
  • 1,707
  • 4
  • 26
  • 50
  • Thank you ever so much, that works perfectly! I need to think of a way of knowing which numbers apply to how many numbers in the tuple, but I guess I can add some text to the for loop! :)..... honestly, cant thank you enough. I had planned to spend a week or two on this! – Nicholas May 08 '17 at 14:38
  • 1
    You're welcome, now I guess you can easily get the standard deviation and other result from the combinations. – Eular May 08 '17 at 14:48
1

I upvoted the last answer, but I'd thought I show you an example that keeps everything in the DataFrame

data = """RefNumber, Reviewer 1, Reviewer 2,Reviewer 3,Reviewer 4,Reviewer 5
9676/2,6,3,2,,
0526/4,6,3,1,5,1
1842/1,5,3,1,5,
2693/3,5,5,1,2,
2515/1,6,3,1,5,3
2987/1,4,1,3,,
3841/1,4,3,1,,
3402/1,4,3,1,5,
"""

import pandas
import itertools
import StringIO
import numpy

buffer = StringIO.StringIO(data)
df = pandas.read_csv(buffer, index_col=0)

# EVERYTHING ABOVE IS MOSTLY SETUP CODE FOR THE EXAMPLE
def get_combos(items, lower_bound=3):
    """
    Return all combinations of values of size lower_bound
    for items
    """
    usable = items.dropna()
    combos = list()
    n_combos = range(lower_bound, len(usable) + 1)
    for r in n_combos:
        combos += list(itertools.combinations(usable, r))
    return combos

df['combos'] = df.apply(get_combos, axis=1)
df['means'] = df['combos'].map(lambda items: [numpy.mean(x) for x in items])
aquil.abdullah
  • 3,059
  • 3
  • 21
  • 40
1

Related Q&A regarding subset generation:

Community
  • 1
  • 1
handle
  • 5,859
  • 3
  • 54
  • 82