1

I have some messy data that I'm passing through a function. The function below tries to take an average. Sometimes items in the list aren't numbers, and will throw an error.

I tried to use regex to replace non numeric characters, but some stuff is still getting through. Any time a bad value shows up (due to messy data) I just want a 0 recorded for that item in the list.

def mean(vals):
    if len(vals) == 0:
        return 0.0

    for val in vals:
        val = re.sub("[^0-9.]", "", str(val))
    print vals
    vals = [float(val) for val in vals]
    return sum(vals) / len(vals)

I'm printing the list of vals just to see where I'm throwing an error. The last vals list is:

['</a>']

How is this possible, given I've regexed everything that isn't a number or a period?

Chris J. Vargo
  • 2,266
  • 7
  • 28
  • 43

3 Answers3

1

You are not changing vals's value by (see Modification of the list items in the loop (python))

val = re.sub("[^0-9.]", "", str(val))

Instead, you could loop through the index of the list and change its content directly.

Community
  • 1
  • 1
zw324
  • 26,764
  • 16
  • 85
  • 118
1

Instead of an re.sub, use try/except...

def mean(vals):
    total = 0.0
    length = 0
    for val in vals:
        try:
            total += float(val)
        except (ValueError, TypeError):
             pass
        length += 1
    return total / length if length else 0.0
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
1

You are not changing your list in the for loop, you are just setting a variable inside the loop scope that doesn't reflects on the list.

To change your list you should do something like this:

>>> vals = [re.sub("[^0-9.]", "", str(val)) for val in vals]
Rodrigo López
  • 4,039
  • 1
  • 19
  • 26