3

I need to apply a function that splits multiple numbers from the fields of a dataframe.

In this dataframe there a all the kids' measurements that are needed for a school: Name, Height, Weight, and Unique Code, and their dream career.

  • The name is only formed of alpha-characters. But some kids might have both first name and middle name. (e.g. Vivien Ester)
  • The height is known to be >= 100 cm for every child.
  • The weight is known to be < 70 kg for every child.
  • The unique code is known to be any number, but it is associated with the letters 'AX', for every child. But the AX may not always be stick to the number (e.g. 7771AX), it might be a space next to it. (e.g. 100 AX)
  • Every kid has its dream career

They could appear in any order, but they always follow the rules from above. However, for some kids some measurements could not appear (e.g.: height or unique code or both are missing or all are missing).

So the dataframe is this:

data = { 'Dream Career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor', 'Fashion Designer', 'Teacher', 'Architect'],
    'Measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1', 'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20', 'Oliver 40.5', 'Julien 35.1 678 AX 111.1', 'Bee 20.0 100.80 88AX']
       }

df = pd.DataFrame (data, columns = ['Dream Career','Measurements'])

And it looks like this:

        Dream Career                Measurements
0          Scientist   Rachel 24.3 100.25 100 AX
1          Astronaut            100.5 Tuuli 30.1
2  Software Engineer  Michael 28.0 7771AX 113.75
3             Doctor    Vivien Ester 40AX 115.20
4   Fashion Designer                 Oliver 40.5
5            Teacher    Julien 35.1 678 AX 111.1
6          Architect        Bee 20.0 100.80 88AX

I try to split all of these measurements into different columns, based on the specified rules.

So the final dataframe should look like this:

       Dream Career         Names  Weight  Height Unique Code
0          Scientist       Rachael    24.3  100.25       100AX
1          Astronaut         Tuuli    30.1  100.50         NaN
2  Software Engineer       Michael    28.0  113.75      7771AX
3             Doctor  Vivien Ester     NaN  115.20        40AX
4   Fashion Designer        Oliver    40.5     NaN         NaN
5            Teacher        Julien    35.1  111.10       678AX
6          Architect           Bee    10.0  100.80        88AX

I tried this code and it works very well, but only on single strings. And I need to do this while in the dataframe and still keep every's kid's associate dream career (so the order is not lost).

num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
def get_weight_height(s):
    nums = re.findall(num_rx, s)
    height = np.nan
    weight = np.nan
    if (len(nums) == 0):
        height = np.nan
        weight = np.nan
    elif (len(nums) == 1):
        if float(nums[0]) >= 100:
            height = nums[0]
            weight = np.nan
        else:
            weight = nums[0]
            height = np.nan
    elif (len(nums) == 2):
        if float(nums[0]) >= 100:
            height = nums[0]
            weight = nums[1]
        else:
            height = nums[1]
            weight = nums[0]
    return height, weight

class_code = {'Small': 'AX', 'Mid': 'BX', 'High': 'CX'}

def hasNumbers(inputString):
    return any(char.isdigit() for char in inputString)

def extract_measurements(string, substring_name):
    height = np.nan
    weight = np.nan
    unique_code = np.nan
    name = ''
    if hasNumbers(string):
        num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
        nums = re.findall(num_rx, string)
        if (substring_name in string):
            special_match = re.search(rf'{num_rx}(?=\s*{substring_name}\b)', string)
            if special_match:
                unique_code = special_match.group()
                string = string.replace(unique_code, '')
                unique_code = unique_code + substring_name
            if len(nums) >= 2 & len(nums) <= 3:
                height, weight = get_weight_height(string)
        else:
            height, weight = get_weight_height(string)
    name = " ".join(re.findall("[a-zA-Z]+", string))
    name = name.replace(substring_name,'')
    return format(float(height), '.2f'), float(weight), unique_code, name

And I apply it like this:

string = 'Anya 101.30 23 4546AX'
height, weight, unique_code, name = extract_measurements(string, class_code['Small'])        
print( 'name is: ', name, '\nh is: ', height, '\nw is: ', weight, '\nunique code is: ', unique_code)

The results are very good.

I tried to apply the function on the dataframe, but I don't know how, I tried this as I got inspired from this and this and this... but they are all different than my problem:

df['height'], df['weight'], df['unique_code'], df['name'] = extract_measurements(df['Measurements'], class_code['Small']) 

I cannot figure out how to apply it on my dataframe. Please help me.

I am at the very beginning, I highly appreciate all the help if you could possibly help me!

Elisa L.
  • 267
  • 1
  • 8
  • Why not split up the data *before* making the DataFrame? – Karl Knechtel Apr 01 '21 at 21:59
  • Because the DataFrame is given initially to me and I have to work on it :( @KarlKnechtel – Elisa L. Apr 01 '21 at 22:01
  • So I just have to take each field of the Measurements column and work on it because it already represents a string, but I don't know how. I tried with *loc* from pandas, but I still didn't manage to – Elisa L. Apr 01 '21 at 22:03

2 Answers2

1

Use apply for rows (axis=1) and choose 'expand' option. Then rename columns and concat to the original df:

pd.concat([df,(df.apply(lambda row : extract_measurements(row['Measurements'], class_code['Small']), axis = 1, result_type='expand')
   .rename(columns = {0:'height', 1:'weight', 2:'unique_code', 3:'name'})
)], axis = 1)

output:

    Dream Career       Measurements                  height    weight  unique_code    name
--  -----------------  --------------------------  --------  --------  -------------  ------------
 0  Scientist          Rachel 24.3 100.25 100 AX        100       100  100AX          Rachel
 1  Astronaut          100.5 Tuuli 30.1                 100       100  nan            Tuuli
 2  Software Engineer  Michael 28.0 7771AX 113.75       100       100  7771AX         Michael
 3  Doctor             Vivien Ester 40AX 115.20         100       100  40AX           Vivien Ester
 4  Fashion Designer   Oliver 40.5                      100       100  nan            Oliver
 5  Teacher            Julien 35.1 678 AX 111.1         100       100  678AX          Julien
 6  Architect          Bee 20.0 100.80 88AX             100       100  88AX           Bee

(note I stubbed def get_weight_height(string) function because your coded did not include it, to always return 100,100)

piterbarg
  • 8,089
  • 2
  • 6
  • 22
0

@piterbarg's answer seems efficient given the original functions, but the functions seem verbose to me. I'm sure there's a simpler solution here that what I'm doing, but what I have below replaces the functions in OP with I think the same results.

First changing the column names to snake case for ease:

df = pd.DataFrame({
     'dream_career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor',
                      'Fashion Designer', 'Teacher', 'Architect'],
     'measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1',
                      'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20',
                      'Oliver 40.5', 'Julien 35.1 678 AX 111.1',
                      'Bee 20.0 100.80 88AX']
})

First the strings in .measurements are turned into lists. From here on list comphrehensions will be applied to each list to filter values.

df.measurements = df.measurements.str.split()

0    [Rachel, 24.3, 100.25, 100, AX]
1               [100.5, Tuuli, 30.1]
2    [Michael, 28.0, 7771AX, 113.75]
3      [Vivien, Ester, 40AX, 115.20]
4                     [Oliver, 40.5]
5     [Julien, 35.1, 678, AX, 111.1]
6          [Bee, 20.0, 100.80, 88AX]
Name: measurements, dtype: object

The second step is filtering out the 'AX' from .measurements and appending 'AX' to all integers. This assumes this example is totally reproducible and all the height/weight measurements are floats, but a different differentiator could be used if this isn't the case.

df.measurements = df.measurements.apply(
     lambda val_list: [val for val in val_list if val!='AX']
).apply(
      lambda val_list: [str(val)+'AX' if val.isnumeric() else val
                        for val in val_list]
)

0      [Rachel, 24.3, 100.25, 100AX]
1               [100.5, Tuuli, 30.1]
2    [Michael, 28.0, 7771AX, 113.75]
3      [Vivien, Ester, 40AX, 115.20]
4                     [Oliver, 40.5]
5       [Julien, 35.1, 678AX, 111.1]
6          [Bee, 20.0, 100.80, 88AX]
Name: measurements, dtype: object

.name and .unique_code are pretty easy to grab. With .unique_code I had to apply a second lambda function to insert NaNs. If there are missing values for .name in the original df the same thing will need to be done there. For cases of multiple names, these are joined together separated with a space.

df['name'] = df.measurements.apply(
    lambda val_list: ' '.join([val for val in val_list if val.isalpha()])
)

df['unique_code'] = df.measurements.apply(
    lambda val_list: [val for val in val_list if 'AX' in val]
).apply(
    lambda x: np.nan if len(x)==0 else x[0]
)

For height and weight I needed to create a column of numerics first and work off that. In cases where there are missing values I'm having to come back around to deal with those.

import re

df['numerics'] = df.measurements.apply(
    lambda val_list: [float(val) for val in val_list
                      if not re.search('[a-zA-Z]', val)]
)

df['height'] = df.numerics.apply(
    lambda val_list: [val for val in val_list if val < 70]
).apply(
    lambda x: np.nan if len(x)==0 else x[0]
)

df['weight'] = df.numerics.apply(
    lambda val_list: [val for val in val_list if val >= 100]
).apply(
    lambda x: np.nan if len(x)==0 else x[0]
)

Finally, .measurements and .numerics are dropped, and the df should be ready to go.

df = df.drop(columns=['measurements', 'numerics'])

        dream_career          name unique_code  height  weight
0          Scientist        Rachel       100AX    24.3  100.25
1          Astronaut         Tuuli         NaN    30.1  100.50
2  Software Engineer       Michael      7771AX    28.0  113.75
3             Doctor  Vivien Ester        40AX     NaN  115.20
4   Fashion Designer        Oliver         NaN    40.5     NaN
5            Teacher        Julien       678AX    35.1  111.10
6          Architect           Bee        88AX    20.0  100.80
semblable
  • 773
  • 1
  • 8
  • 26
  • 1
    Thank you very much! This is a much clear code that resolve the issues than mine. It just has a small problem, for example if in measurements it comes 'Michell 180', then it will automatically put 180AX in the unique_code column, instead of putting it as the height. How this issue could be resolved? @k_n_c – Elisa L. Apr 02 '21 at 12:59
  • So this is what I was talking about with assuming height & weight are always floats. Since height is sometimes an integer, you need some other differentiator to know which one to treat as height and which one to treat part of a unique code. I'm guessing order might be the differentiator, so an integer immediately preceeding `"AX"` is always part of a unique code — is this right? (Also, ideally you'd want to include this case in the example df so that it's minimally reproducible.) – semblable Apr 02 '21 at 13:56
  • Also if this answer is helpful, please upvote & checkmark! Thanks. – semblable Apr 02 '21 at 13:57