5

I have a Dataframe that has a column with integers that I would like to combine with a column with string values. Both columns are of object dtype. The problem is that these columns can also be NaN.

The solutions I have been able to find result in different errors or undesirable outcomes.

My dataframe is like the below:

index dosagedurationunit dosagequantityvalue dosagequantityunit quantityvalue
0 day NaN NaN NaN
1 day NaN tablet(s) NaN
2 day 2 NaN NaN
3 day 1 tablet(s) NaN
4 day 2 tablet(s) NaN

Code to create the dataframe:

df = pd.DataFrame([["day",None,None,None],["day",None,"tablet(s)",None],["day",2,"tablet(s)",None],["day",1,"tablet(s)",None],["day",2,"tablet(s)",None]], columns=["dosagedurationunit","dosagequantityvalue","dosagequantityunit","quantityvalue"])

The below answer will work on columns of the same type (str): Combine pandas string columns with missing values

  • Converting the columns to str dtype prior to concatenation results in 'nan' strings such as "NaN tablet(s)".
  • Using the below code results in TypeErrors when there are integers in one of the columns to be 'concatenated'.
df['DOSE'] = df[['dosagequantityvalue', 'dosagequantityunit']].apply(
            lambda x: None if x.isnull().all() else ' '.join(x.dropna()), axis=1)
  • TypeError: sequence item 0: expected str instance, int found

Desired output dataframe:

index dosagedurationunit dosagequantityvalue dosagequantityunit quantityvalue NORMALIZED_DOSE
0 day NaN NaN NaN NaN
1 day NaN tablet(s) NaN tablet(s)
2 day 2 NaN NaN 2
3 day 1 tablet(s) NaN 1 tablet(s)
4 day 2 tablet(s) NaN 2 tablet(s)

Realistically, a NORAMLIZED_DOSE of NaN or "tablet(s)" provides zero information. I could just drop all rows where dosagequantityvalue is NaN, but I don't know if this will work on a production/non-sample dataset. Besides, I still need a function that handles this operation gracefully.

How can I concatenate two columns (dosagequantityvalue & dosagequantityunit) into a new column (NORMALIZED_DOSE) while handling cases where there may be integers and NaN values in one or both columns?

tdy
  • 36,675
  • 19
  • 86
  • 83
bekfen
  • 81
  • 5

2 Answers2

3

Update

If I put my answer in a function replace_concat_replace():

def replace_concat_replace(df):
    df = df.replace(np.nan, '').astype(str)
    s = df.dosagequantityvalue + ' ' + df.dosagequantityunit
    s = s.str.strip().replace('', np.nan)
    return s

Then its %timeit is ~2.5x faster than concat_df_cols_new():

>>> %timeit df['NORMALIZED_DOSAGE'] = replace_concat_replace(df[['dosagequantityvalue', 'dosagequantityunit']])
41.7 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['NORMALIZED_DOSAGE'] = concat_df_cols_new(df[['dosagequantityvalue', 'dosagequantityunit']])
105 ms ± 15.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

However, concat_df_cols_new() can handle arbitrary columns.


Original answer

You can replace() every nan with an empty string '' before concatenating, then replace() them back after the concat.

Note that I used nan instead of None to match your tables.

df = pd.DataFrame([['day',np.nan,np.nan,np.nan],['day',np.nan,'tablet(s)',np.nan],['day',2,np.nan,np.nan],['day',1,'tablet(s)',np.nan],['day',2,'tablet(s)',np.nan]],columns=['dosagedurationunit','dosagequantityvalue','dosagequantityunit','quantityvalue'])

# replace nans with ''
df = df.replace(np.nan, '')

# concat value + unit
df['NORMALIZED_DOSAGE'] = df.dosagequantityvalue.astype(str) + ' ' + df.dosagequantityunit.astype(str)

# put nans back
df = df.replace(r'^\s*$', np.nan, regex=True)

Output:

  dosagedurationunit dosagequantityvalue dosagequantityunit quantityvalue NORMALIZED_DOSAGE
0                day                 NaN                NaN           NaN               NaN
1                day                 NaN          tablet(s)           NaN         tablet(s)
2                day                 2.0                NaN           NaN               2.0
3                day                 1.0          tablet(s)           NaN     1.0 tablet(s)
4                day                 2.0          tablet(s)           NaN     2.0 tablet(s)
tdy
  • 36,675
  • 19
  • 86
  • 83
  • 1
    Thank you tdy! This approach was very helpful in getting me where I needed to be. I ended up doing a modified approach which I will share below. – bekfen Apr 06 '21 at 00:45
2

In looking for an optimized solution I ended up doing a modified approach to the answer provided by tdy and the one here Combine pandas string columns with missing values

I ended up turning this code into a function as I had a need to use it repeatedly. Hope this helps someone else who comes across the same problem:

# functions
def concat_df_cols(df, source_cols, target_col, sep=" ", na_rep=""):
    """ Add separator and replace NaN to empty space, while handling columns of different types.

    Args:
        df (dataframe): The dataframe to be modified
        source_cols (list): The columns to concatenate.
        target_col (str): The destination column for the concatenated source columns.
        sep (str): The separator with which to concatenate the columns.
        na_rep (str): The default replacement value for NaN values.
                      # Note, anything other than the default empty string will result in the
                        na_rep persisting after the concatentation.

    Returns:
        dataframe: The modified dataframe
    """
    df = df.replace(np.nan, na_rep)  # Replace nans with ''
    df[source_cols] = df[source_cols].astype(str)  # Convert cols to str to permit concatenation
    df = df.replace(r'^\s*$', np.nan, limit=1, regex=True)  # Put NaNs back
    # Concat source_cols into target_col
    df[target_col] = df[source_cols].apply(
        lambda x: None if x.isnull().all() else sep.join(x.dropna()), axis=1)
    return df


def concat_df_cols_fast(df, sep=" ", na_rep=""):
    """ Add separator and replace NaN to empty space, while handling columns of different types.

    Args:
        df (dataframe): The dataframe to be modified, with only source_cols included**.
        sep (str): The separator with which to concatenate the columns.
        na_rep (str): The default replacement value for NaN values.
                      # Note, anything other than the default empty string will result in the
                        na_rep persisting after the concatentation.

    Returns:
        dataframe: The modified dataframe
    """
    df = df.applymap(str)  # Convert cols to str to permit concatenation
    # Add separator and replace NaN to empty space
    # Convert to lists
    arr = df.fillna(na_rep).values.tolist()
    # Replace empty spaces to NaN using list comprehension
    s = pd.Series([sep.join(x).strip(sep) for x in arr if x]).replace('^$', np.nan, regex=True)
    # Replace NaN to None
    s = s.where(s.notnull(), None)
    return s

# setup
df = pd.DataFrame([['day',np.nan,np.nan,np.nan],['day',np.nan,'tablet(s)',np.nan],['day',2,np.nan,np.nan],['day',1,'tablet(s)',np.nan],['day',2,'tablet(s)',np.nan]],columns=['dosagedurationunit','dosagequantityvalue','dosagequantityunit','quantityvalue'])
# Make the df 50000 rows
df = pd.concat([df]*10000).reset_index(drop=True)

##### Approach 1 #####
# This approach took on average 0.27553908449 seconds
df['NORMALIZED_DOSAGE'] = concat_df_cols_fast(df[['dosagequantityvalue', 'dosagequantityunit']], )

##### Approach 2 #####
# This approach took on average 5.92792463605 seconds
# replace nans with ''
df = df.replace(np.nan, '')
# concat value + unit
df['NORMALIZED_DOSAGE'] = df.dosagequantityvalue.astype(str) + ' ' + df.dosagequantityunit.astype(str)
# put nans back
df = df.replace(r'^\s*$', np.nan, limit=1, regex=True)

##### Approach 3 #####
# This approach took on average 27.7539046249 seconds
df = concat_df_cols(df, source_cols=['dosagequantityvalue', 'dosagequantityunit'],
                                  target_col='NORMALIZED_DOSAGE')
       

UPDATE: Refactored functions:


def concat_df_cols_new(df, sep=" ", na_rep=""):
    """ Add separator and replace NaN to empty space, while handling columns of different types.

    Args:
        df (dataframe): The dataframe to be modified, with only source_cols included**.
        source_cols (list): The columns to concatenate.
        sep (str): The separator with which to concatenate the columns.
        na_rep (str): The default replacement value for NaN values.
                      # Note, anything other than the default empty string will result in the
                        na_rep persisting after the concatentation.

    Returns:
        dataframe: The modified dataframe
    """
    df = df.replace(np.nan, sep, inplace=False)
    df = df.applymap(str)  # Convert cols to str to permit concatenation
    # Add separator and replace NaN to empty space
    # Convert to lists
    arr = df.values.tolist()
    # Replace empty spaces to NaN using list comprehension
    df = pd.Series([sep.join(x).strip(sep) for x in arr]).replace('^$', np.nan, regex=True)
    return df

def replace_concat_replace_new(df):
    df = df.replace(np.nan, '')
    s = df.dosagequantityvalue.astype(str) + ' ' + df.dosagequantityunit.astype(str)
    s = s.replace(r'^\s*$', np.nan, regex=True)
    s = s.replace(r'\s*$', '', regex=True)  # Trim trailing whitespace
    s = s.replace(r'^\s*', '', regex=True)  # Trim leading whitespace
    return s

df['NORMALIZED_DOSAGE_CONCAT'] = concat_df_cols_new(df[['dosagequantityvalue', 'dosagequantityunit']])
# 131.98 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['NORMALIZED_DOSAGE'] = replace_concat_replace_new(df[['dosagequantityvalue', 'dosagequantityunit']])
# 395.97 ms ± 28.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Ultimately, I'll go with concat_df_cols_new simply because I can use this function on dataframes with different column names and the runtime is on currently ~3x better. Unless there's a solution for those too..

bekfen
  • 81
  • 5
  • Hmm interesting, my version is actually faster when I `%timeit`. I updated my answer with the `%timeit` results for my code vs `concat_df_cols_fast()`. – tdy Apr 06 '21 at 22:10
  • 1
    That's interesting. I got a similar result to yours when I ran it again just now (different times but replace_concat_replace is ~1.7x as fast). I believe the difference compared to the initial runtime I recorded was due to the code running faster as a function. https://stackoverflow.com/questions/11241523/why-does-python-code-run-faster-in-a-function – bekfen Apr 07 '21 at 03:06
  • 1
    Did some more refactoring and had them more or less in lockstep on small dfs. I've added the new function as concat_df_cols_new. I also had to add some logic to replace_concat_replace to strip the whitespace. Regex is very costly and results in the runtime for replace_concat_replace_new increasing dramatically on large dataframes. Ultimately, I'll go with concat_df_cols_new simply because I can use this function on dataframes with different column names and the runtime is on currently ~3x better. Unless there's a solution for those too.. Appreciate the help! – bekfen Apr 07 '21 at 04:17
  • Oh nice link about function vs non-function speed. Re:stripping whitespace, I did another refactoring and got `replace_concat_replace()` to be faster again by replacing those 3 regexes with a single: `s = s.str.strip().replace('', np.nan)`. However, `concat_df_cols()` still seems like the best way to handle arbitrary columns. – tdy Apr 07 '21 at 16:29
  • 1
    Yeah, didn't know about the speed difference of code in functions! Nice update, I thought there might be a better solution to the whitespace issue but agreed, handling arbitrary columns is a nice benefit of the concat_df_cols approach. Wish there was a way to add two generic cols together using col + ' ' + col. Thanks a lot for this iterative back and forth! – bekfen Apr 07 '21 at 18:34