3

I have N large lists of different length, where each value in the list represents the signal over a fixed window of length 25. I.e., I take the average value of the signal every 25 seconds/bases/etc, and I store that value in a list.

I do this for different experiments/devices that run for different time length (all multiples of 25 btw).

I.e., list 1 is a 1000 run, with 1000/25=40 values in the list1, list 2 is a 1025 minutes run, with 1025/25 = 41 values in list2, list3 is a 2525 run, with 2525/25 = 101 values in list3, etc...

Now, for the sake of comparison, I'd like to re-scale each list to same number of bins, let us say 40 bins.

As a matter of fact list1resized length will be 40 and its values would not change, since 1000/40 = 25 exactly. list2resized would go from a length of 41 values to a length of 40 values, and list3 would go from a length of 101 values to a length of 40 values (aka all lists are now of the same size).

And here comes the question. How would I resize each list to a fixed length of 40 by taking the weighted averages over the appropriate bins?

An example will clarify the question.

list1 = [4.8, 6.9, ...]  #40 values for the 1000 run
list2 = [5.6, 7.8, 8.9, 13.4, ...] #41 values for the 1025 run
list3 = [4.1, 5.6, 10.3, 9.8, 40, 30, 21.4, 3, 2,...] #101 values for the 2525 run

Now, the resized lists should look like:

list1resized = [4.8*25/25, 6.9*25/25,...] #40 values for the 1000 run
list2resized = [(5.6*25+7.8*0.625)/25.625, (7.8*24.375+8.9*1.275)/25.625, (23.725*8.9+1.9*13.4)/25.625,...] # 40 values, averaged accordingly, for the 1025 run
list3resized = [(4.1*25+5.6*25+10.3*13.125)/(63.125), (10.3*11.875+9.8*25+40*25+30*1.25)/(63.125),...] # 40 values, averaged accordingly, for the 2525 run

In order to obtain such average values for each element of the resized list, we took the weighted average over the new resized bins (i.e., average over 1000/40=25 for list1, average over 1025/40=25.625 for list2, average over 2525/40=63.125 for list3, etc.). I.e, same but with the formulas I used for the weighted averages:

list1resized = [4.8*25/25, 6.9*25/25,...] #40 values for the 1000 run
list2resized = [(5.6*25+7.8*0.625)/25.625, (7.8*24.375+8.9*(25.65-24.375))/(25.625), (23.725*8.9+(25.625-23.725)*13.4)/(25.625),...] # 40 values, averaged accordingly, for the 1025 run
list3resized = [(4.1*25+5.6*25+10.3*13.125)/(63.125), (10.3*(25-13.125)+9.8*25+40*25+30*(63.125-25*3+13.125)))/(63.125),...] # 40 values, averaged accordingly, for the 2525 run

As you can see it can get messy, and hard to deal with, but I am looking for a pythonic, elegant and fast solution to the problem.

I have to do this for many lists many times so it's be nice to considering run time.

Not sure if you have any ideas, but help would be greatly appreciated.

Thanks.

Dnaiel
  • 7,622
  • 23
  • 67
  • 126
  • I am actually quite clueless on how to actually code such framework in python. I am struggling on the fact that I'd have to take different sublists each time, and their length would change ... not sure how to do it actually. – Dnaiel Oct 16 '12 at 17:56
  • If you want your question to focus on the best mathematical algorithm to do this, [scicomp.SE] might actually be the place to ask. Then you could ask here about how to implement that algorithm in Python. – David Z Oct 16 '12 at 19:49
  • I am glad other people seemed to understand, and that you apparently got the help you needed. To me, this is clear as mud. I can see how you arrived at 25, 25.625, and 63.125; but beyond that I'm completely lost. – John Y Oct 17 '12 at 04:44

3 Answers3

3

How about this funky [maybe] solution?

First the list of measurements...

l = [5.6, 7.8, 8.9, 13.4]

Copy each measurement 25 times (once for each second...)

l1 = [item for sublist in [list(itertools.repeat(k,25)) for k in l] for item in sublist]

Normalize over each second:

l2 = map(lambda x: x / 25., l1)

Refer to this SO post for a function (copied below) that slices a list into n almost equal partitions:

Python: Slicing a list into n nearly-equal-length partitions

def partition(lst, n):
    division = len(lst) / float(n)
    return [ lst[int(round(division * i)): int(round(division * (i + 1)))] for i in xrange(n) ]

Define the new length of your list

new_len = 2

Chop up your per-second list into the number of sections you want:

l3 = partition(l2, new_len)

Sum the values for each second in each partition

l4 = map(sum, l3)

Normalize for the difference in the size of the lists

l5 = map(lambda x: x * new_len / float(len(l)), l4)

Admire the results:

print l5
Community
  • 1
  • 1
juniper-
  • 6,262
  • 10
  • 37
  • 65
  • 2
    Or in short, `[sum(t) * new_len / float(len(l)) for t in partition(itertools.chain(itertools.repeat(k/25., 25) for k in l), new_len)]` (well, something like that) – David Z Oct 16 '12 at 20:30
  • I had the thought of chopping things up into seconds, but quite frankly, using seconds as a unit of time is entirely unnecessary imo. A ratio based solution that completely ignores the 'seconds' parameter (as seconds are a function of the length of the list in the first place) makes more mathematical sense. – kreativitea Oct 16 '12 at 20:47
  • @kreativitea: You're probably right, but it might be tougher to stuff into a one-liner like David Zaslavsky proposed. – juniper- Oct 16 '12 at 20:58
  • @juniper I think the key to finding an good solution is to find a way to generate the coefficients of the sublists. The sum of the coefficients of each sublist is equal to the weighted average ratio--`[(4.1*25+5.6*25+10.3*13.125)/(63.125),`: in this, `25+25+13.125 = 63.125`. I'm not very good with modulo arithmetic to generate these kinds of numbers on the fly, but I'm sure there's a good way to derive the series `1,1,2.525` from those numbers. – kreativitea Oct 16 '12 at 21:14
  • @DavidZaslavsky Thanks. I tried your solution but gave me an error: TypeError: object of type 'itertools.chain' has no len(). Not sure what it means. juniper, thanks, awesome solution, it's a little bit slow for very long datasets but it works. – Dnaiel Oct 17 '12 at 00:34
  • @Dnaiel oh, yeah, I just wrote that off the top of my head without testing it (hence the "something like that"). That error comes because `itertools.chain` returns a generator but `partition` needs a list, so you could fix it like `partition(list(itertools.chain(...)))`. But it's just meant to be a condensed version of juniper-'s solution, it doesn't add anything new. – David Z Oct 17 '12 at 00:40
  • @DavidZaslavsky aha thanks for your explanation david, really nice condensed version though – Dnaiel Oct 17 '12 at 00:41
3

This is a pretty difficult problem, but I think you're making it more complicated than it actually is. I'll start off with a few observations.

Observation 1. You can factor out a lot of things until after to reduce the coding involved. Instead of dividing and multiplying by 25 (which gets real complex real quick), save that operation until the end.

list2resized = [i/25.625 for i in [(5.6*25+7.8*0.625), 
                                   (7.8*24.375+8.9*(25.65-24.375)), 
                                   (23.725*8.9+(25.625-23.725)*13.4),...]]

# consider using ratios, rather than division
list2resized = [i * 1.025 for i in [(5.6 * 1 + 7.8 * .025), 
                                    (7.8 * .975 + 8.9 * .050), 
                                    (8.9 * .95 + 13.4 * .075),...]]

Observation 2. The Coefficient of every progressing term is therefore an increasing step of 25. Save the division by 1000 until afterwords-- You can multiply the whole equation by 1000 and use a modulus operator, if you chose to...

 list2resized = [i * 1025/1000 for i in [(5.6 * 1000 + 7.8 * 25), # 1025 steps in
                                          (7.8 * 975 + 8.9 * 50), # 2050 steps in
                                          (8.9 * 950 + 13.4 * 75) # 3075 steps in

Observation 3.

Each 'bin' needs in the final resize needs to be 1.025 long (given 41 starting bins, but ultimately depending on the length of the list to be adjusted). 1.0 * list[0] + .025 * list[1] Considering observation 2, you can rewrite the entire equation as a series--

# the sum of the coefficients is always equal to the resize ratio
(1 * n1) + (.025 * n2)
(.975 * n2) + (.050 * n3) 
(.950 * n3) + (.075 * n4)
...

etc.

Now, you can generate these coefficients--

a = [i/40.0 for i in range(0, 40)][1:]
b = [1 - i/40.0 for i in range(0, 40)]

But these are easy because the 'rotation' doesn't ever catch up to itself. All you have to do is iterate over the coefficients in each bin for each respective part of the equation, then zip them together, and sum them. This only compresses a list to maximally, half its original size. In the case where this is true, you should be using the above algorithm, it will be significantly faster than anything else you can throw at it, since it's just the creation of a list of numbers and then multiplication over list comprehension.

But, the complex case is the example where you have 101 numbers, where more than one term (and sometimes, a fourth!) appears...

101/40.0 = 2.525 
# your bins need to be 2.525 units long.  

data = [4.1, 5.6, 10.3, 9.8, 40, 30, 21.4, 3, 2,...]

# calculated by hand
(1 * n1) + (1 * n2) + (.525 * n3) 
(.475 * n3) + (1 * n4) + (1 * n5) + (.05 * n6)
(.95 * n6) + (1 * n7) + (.575 * n8)
(.425 * n8) + (1 * n9) + (1 * n10) + (.100 * n11)

So, we need a better way to generate coefficients. As previously observed (3), the sum of the coefficients in one of the final terms is the ratio of old items to new items.

101:40 = 2.525:1
41:40 = 1.025:1

Generating coefficients comes next. We're going to use a list-in-list data structure that iterates through the sublists until nothing is left.

[(1, 1, .525), (.475, 1, 1, .05) ...]

The first sublist maps to item 1 in your new list. The second sublist, to item 2, and so forth, all the way to the end. The sum of all the items in all the sublists should be equal to the items n (in this case 101) in the original list.

I'm going to go ahead and post this now, since I have to actually do work. I'll try to come back and work on it later.

/edit

Here is a function to generate the coefficients.

n = 1000
d = 2525
items = 101
def coefficients(n, d, items):
    start = [n for i in xrange(items)]
    result = []
    p = []
    while True:
        while sum(p) < d:
            try:
                p.append(start.pop())
            except IndexError:
                return result
        extra = sum(p) % d
        p[-1] = n - extra
        result.append(p)
        p = [extra]

Iterate over the coefficients to return your final list of 40. Let me know if you need any more help.

kreativitea
  • 1,741
  • 12
  • 14
  • thanks a lot, great ideas here. If you have time i'd still love to hear more from you, but so far this is a very good thought. It is definitely the fastest implementation and since my lists are quite long it'd be nice to have a fast option. – Dnaiel Oct 17 '12 at 00:37
  • I was thinking along these lines myself, but I didn't have time to work out the details. Nice :-) – David Z Oct 17 '12 at 00:40
  • @Dnaiel I added a function to generate coefficients, this should now be a complete answer. – kreativitea Oct 17 '12 at 18:01
2

I'm still fairly new to Python, so you'll need someone else to rate this for pythonicity, elegance, and speed.

class StretchableList(list):
    def stretch(self, newlen):
        old = [ (i * (newlen-1), self[i]) for i in range(len(self)) ]
        new = [ i * (len(self)-1) for i in range(newlen) ]
        self[:] = []
        for n in new:
            while len(old) > 1 and n >= old[1][0]:
                old.pop(0)
            if old[0][0] == n:
                self.append(old[0][1])
            else:
                self.append( old[0][1] + \
                             float((n-old[0][0]))/(old[1][0]-old[0][0]) * \
                             (old[1][1]-old[0][1]) )
        return self

Basically, this defines a subclass of list that just adds a method called stretch. Call it with the desired new length, and it will stretch or compress it to the new length. I performed the weighted average a little differently than you did... it may or may not be equivalent, but I'm assuming the math part you can modify as needed.

glibdud
  • 7,550
  • 4
  • 27
  • 37
  • @glidbud. Thanks for your answer, haven't tested it yet but seems to be a good one. I'll compare it to the other ones for speed purposes. – Dnaiel Oct 17 '12 at 00:39