I need to apply a function that splits multiple numbers from the fields of a dataframe.
In this dataframe there a all the kids' measurements that are needed for a school: Name, Height, Weight, and Unique Code, and their dream career.
- The name is only formed of alpha-characters. But some kids might have both first name and middle name. (e.g. Vivien Ester)
- The height is known to be >= 100 cm for every child.
- The weight is known to be < 70 kg for every child.
- The unique code is known to be any number, but it is associated with the letters 'AX', for every child. But the AX may not always be stick to the number (e.g. 7771AX), it might be a space next to it. (e.g. 100 AX)
- Every kid has its dream career
They could appear in any order, but they always follow the rules from above. However, for some kids some measurements could not appear (e.g.: height or unique code or both are missing or all are missing).
So the dataframe is this:
data = { 'Dream Career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor', 'Fashion Designer', 'Teacher', 'Architect'],
'Measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1', 'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20', 'Oliver 40.5', 'Julien 35.1 678 AX 111.1', 'Bee 20.0 100.80 88AX']
}
df = pd.DataFrame (data, columns = ['Dream Career','Measurements'])
And it looks like this:
Dream Career Measurements
0 Scientist Rachel 24.3 100.25 100 AX
1 Astronaut 100.5 Tuuli 30.1
2 Software Engineer Michael 28.0 7771AX 113.75
3 Doctor Vivien Ester 40AX 115.20
4 Fashion Designer Oliver 40.5
5 Teacher Julien 35.1 678 AX 111.1
6 Architect Bee 20.0 100.80 88AX
I try to split all of these measurements into different columns, based on the specified rules.
So the final dataframe should look like this:
Dream Career Names Weight Height Unique Code
0 Scientist Rachael 24.3 100.25 100AX
1 Astronaut Tuuli 30.1 100.50 NaN
2 Software Engineer Michael 28.0 113.75 7771AX
3 Doctor Vivien Ester NaN 115.20 40AX
4 Fashion Designer Oliver 40.5 NaN NaN
5 Teacher Julien 35.1 111.10 678AX
6 Architect Bee 10.0 100.80 88AX
I tried this code and it works very well, but only on single strings. And I need to do this while in the dataframe and still keep every's kid's associate dream career (so the order is not lost).
num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
def get_weight_height(s):
nums = re.findall(num_rx, s)
height = np.nan
weight = np.nan
if (len(nums) == 0):
height = np.nan
weight = np.nan
elif (len(nums) == 1):
if float(nums[0]) >= 100:
height = nums[0]
weight = np.nan
else:
weight = nums[0]
height = np.nan
elif (len(nums) == 2):
if float(nums[0]) >= 100:
height = nums[0]
weight = nums[1]
else:
height = nums[1]
weight = nums[0]
return height, weight
class_code = {'Small': 'AX', 'Mid': 'BX', 'High': 'CX'}
def hasNumbers(inputString):
return any(char.isdigit() for char in inputString)
def extract_measurements(string, substring_name):
height = np.nan
weight = np.nan
unique_code = np.nan
name = ''
if hasNumbers(string):
num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
nums = re.findall(num_rx, string)
if (substring_name in string):
special_match = re.search(rf'{num_rx}(?=\s*{substring_name}\b)', string)
if special_match:
unique_code = special_match.group()
string = string.replace(unique_code, '')
unique_code = unique_code + substring_name
if len(nums) >= 2 & len(nums) <= 3:
height, weight = get_weight_height(string)
else:
height, weight = get_weight_height(string)
name = " ".join(re.findall("[a-zA-Z]+", string))
name = name.replace(substring_name,'')
return format(float(height), '.2f'), float(weight), unique_code, name
And I apply it like this:
string = 'Anya 101.30 23 4546AX'
height, weight, unique_code, name = extract_measurements(string, class_code['Small'])
print( 'name is: ', name, '\nh is: ', height, '\nw is: ', weight, '\nunique code is: ', unique_code)
The results are very good.
I tried to apply the function on the dataframe, but I don't know how, I tried this as I got inspired from this and this and this... but they are all different than my problem:
df['height'], df['weight'], df['unique_code'], df['name'] = extract_measurements(df['Measurements'], class_code['Small'])
I cannot figure out how to apply it on my dataframe. Please help me.
I am at the very beginning, I highly appreciate all the help if you could possibly help me!