1

I am having a csv file , which contains a range of numbers in a column. I also have list of valid range. I need identify the possible range for the matching numbers in csv.

valid_range = ["0-1", "1-2", "0-5", "5-10", "10-15", "15-20", ">20","mixed"]

csv data :

What I have:

enter image description here

What I need:

enter image description here

Anything greater than 20 will have ">20" and mixed will have the same.

df = {'Gender': {0: 'male',
  1: 'male',
  2: 'male',
  3: 'female',
  4: 'male',
  5: 'male',
  6: 'male',
  7: 'male',
  8: 'male',
  9: 'mixed',
  10: 'female'},
 'Range': {0: 'mixed',
  1: 'mixed',
  2: '5-6',
  3: '4-5',
  4: '5-6',
  5: '12-14',
  6: '9-10',
  7: 'mixed',
  8: '17-18',
  9: 'mixed',
  10: '>20'}}
Devang Sanghani
  • 731
  • 5
  • 14
milas86534
  • 39
  • 4

1 Answers1

0

If you don't have overlap, you can rely on the upper bound.

Then use of a bit or regex magic and pandas.cut:

# non overlapping ranges
valid_range = ["0-5", "5-10", "10-15", "15-20", ">20", "mixed"]

# define bins (could also be coded as list directly!)
bins = (pd.Series(valid_range[:-2]).str.extract('(\d+)$', expand=False)
          .astype(int).to_list()
        )
bins = [0]+bins+[float('inf')]
# [0, 5, 10, 15, 20, inf]

df['Valid_Range'] = (           # replace ">x" with upper bound + 1
 pd.cut(df['Range'].str.replace('>.*', f'{bins[-1]+1}', regex=True)
                  # extract trailing digits 
                  .str.extract('(\d+)$', expand=False).astype(float),
        bins=bins, labels=valid_range[:-1])
    .values.add_categories(valid_range[-1])  # cut output is Categorical
    .fillna(valid_range[-1])  # fill nans that correspond to initial "mixed"
)

output:

    Gender  Range Valid_Range
0     male  mixed       mixed
1     male  mixed       mixed
2     male    5-6        5-10
3   female    4-5         0-5
4     male    5-6        5-10
5     male  12-14       10-15
6     male   9-10        5-10
7     male  mixed       mixed
8     male  17-18       15-20
9    mixed  mixed       mixed
10  female    >20         >20
mozway
  • 194,879
  • 13
  • 39
  • 75