Combine (concatenate) pandas columns with missing values AND different types (str & int)

Question

I have a Dataframe that has a column with integers that I would like to combine with a column with string values. Both columns are of object dtype. The problem is that these columns can also be NaN.

The solutions I have been able to find result in different errors or undesirable outcomes.

My dataframe is like the below:

index	dosagedurationunit	dosagequantityvalue	dosagequantityunit	quantityvalue
0	day	NaN	NaN	NaN
1	day	NaN	tablet(s)	NaN
2	day	2	NaN	NaN
3	day	1	tablet(s)	NaN
4	day	2	tablet(s)	NaN

Code to create the dataframe:

df = pd.DataFrame([["day",None,None,None],["day",None,"tablet(s)",None],["day",2,"tablet(s)",None],["day",1,"tablet(s)",None],["day",2,"tablet(s)",None]], columns=["dosagedurationunit","dosagequantityvalue","dosagequantityunit","quantityvalue"])

The below answer will work on columns of the same type (str): Combine pandas string columns with missing values

Converting the columns to str dtype prior to concatenation results in 'nan' strings such as "NaN tablet(s)".
Using the below code results in TypeErrors when there are integers in one of the columns to be 'concatenated'.

df['DOSE'] = df[['dosagequantityvalue', 'dosagequantityunit']].apply(
            lambda x: None if x.isnull().all() else ' '.join(x.dropna()), axis=1)

TypeError: sequence item 0: expected str instance, int found

Desired output dataframe:

index	dosagedurationunit	dosagequantityvalue	dosagequantityunit	quantityvalue	NORMALIZED_DOSE
0	day	NaN	NaN	NaN	NaN
1	day	NaN	tablet(s)	NaN	tablet(s)
2	day	2	NaN	NaN	2
3	day	1	tablet(s)	NaN	1 tablet(s)
4	day	2	tablet(s)	NaN	2 tablet(s)

Realistically, a NORAMLIZED_DOSE of NaN or "tablet(s)" provides zero information. I could just drop all rows where dosagequantityvalue is NaN, but I don't know if this will work on a production/non-sample dataset. Besides, I still need a function that handles this operation gracefully.

How can I concatenate two columns (dosagequantityvalue & dosagequantityunit) into a new column (NORMALIZED_DOSE) while handling cases where there may be integers and NaN values in one or both columns?

tdy · Answer 1 · 2021-04-07T16:19:25.193

Update

If I put my answer in a function replace_concat_replace():

def replace_concat_replace(df):
    df = df.replace(np.nan, '').astype(str)
    s = df.dosagequantityvalue + ' ' + df.dosagequantityunit
    s = s.str.strip().replace('', np.nan)
    return s

Then its %timeit is ~2.5x faster than concat_df_cols_new():

>>> %timeit df['NORMALIZED_DOSAGE'] = replace_concat_replace(df[['dosagequantityvalue', 'dosagequantityunit']])
41.7 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['NORMALIZED_DOSAGE'] = concat_df_cols_new(df[['dosagequantityvalue', 'dosagequantityunit']])
105 ms ± 15.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

However, concat_df_cols_new() can handle arbitrary columns.

Original answer

You can replace() every nan with an empty string '' before concatenating, then replace() them back after the concat.

Note that I used nan instead of None to match your tables.

df = pd.DataFrame([['day',np.nan,np.nan,np.nan],['day',np.nan,'tablet(s)',np.nan],['day',2,np.nan,np.nan],['day',1,'tablet(s)',np.nan],['day',2,'tablet(s)',np.nan]],columns=['dosagedurationunit','dosagequantityvalue','dosagequantityunit','quantityvalue'])

# replace nans with ''
df = df.replace(np.nan, '')

# concat value + unit
df['NORMALIZED_DOSAGE'] = df.dosagequantityvalue.astype(str) + ' ' + df.dosagequantityunit.astype(str)

# put nans back
df = df.replace(r'^\s*$', np.nan, regex=True)

Output:

  dosagedurationunit dosagequantityvalue dosagequantityunit quantityvalue NORMALIZED_DOSAGE
0                day                 NaN                NaN           NaN               NaN
1                day                 NaN          tablet(s)           NaN         tablet(s)
2                day                 2.0                NaN           NaN               2.0
3                day                 1.0          tablet(s)           NaN     1.0 tablet(s)
4                day                 2.0          tablet(s)           NaN     2.0 tablet(s)

Thank you tdy! This approach was very helpful in getting me where I needed to be. I ended up doing a modified approach which I will share below. — bekfen, Apr 06 '21 at 00:45

bekfen · Answer 2 · 2021-04-07T04:27:18.763

In looking for an optimized solution I ended up doing a modified approach to the answer provided by tdy and the one here Combine pandas string columns with missing values

I ended up turning this code into a function as I had a need to use it repeatedly. Hope this helps someone else who comes across the same problem:

# functions
def concat_df_cols(df, source_cols, target_col, sep=" ", na_rep=""):
    """ Add separator and replace NaN to empty space, while handling columns of different types.

    Args:
        df (dataframe): The dataframe to be modified
        source_cols (list): The columns to concatenate.
        target_col (str): The destination column for the concatenated source columns.
        sep (str): The separator with which to concatenate the columns.
        na_rep (str): The default replacement value for NaN values.
                      # Note, anything other than the default empty string will result in the
                        na_rep persisting after the concatentation.

    Returns:
        dataframe: The modified dataframe
    """
    df = df.replace(np.nan, na_rep)  # Replace nans with ''
    df[source_cols] = df[source_cols].astype(str)  # Convert cols to str to permit concatenation
    df = df.replace(r'^\s*$', np.nan, limit=1, regex=True)  # Put NaNs back
    # Concat source_cols into target_col
    df[target_col] = df[source_cols].apply(
        lambda x: None if x.isnull().all() else sep.join(x.dropna()), axis=1)
    return df


def concat_df_cols_fast(df, sep=" ", na_rep=""):
    """ Add separator and replace NaN to empty space, while handling columns of different types.

    Args:
        df (dataframe): The dataframe to be modified, with only source_cols included**.
        sep (str): The separator with which to concatenate the columns.
        na_rep (str): The default replacement value for NaN values.
                      # Note, anything other than the default empty string will result in the
                        na_rep persisting after the concatentation.

    Returns:
        dataframe: The modified dataframe
    """
    df = df.applymap(str)  # Convert cols to str to permit concatenation
    # Add separator and replace NaN to empty space
    # Convert to lists
    arr = df.fillna(na_rep).values.tolist()
    # Replace empty spaces to NaN using list comprehension
    s = pd.Series([sep.join(x).strip(sep) for x in arr if x]).replace('^$', np.nan, regex=True)
    # Replace NaN to None
    s = s.where(s.notnull(), None)
    return s

# setup
df = pd.DataFrame([['day',np.nan,np.nan,np.nan],['day',np.nan,'tablet(s)',np.nan],['day',2,np.nan,np.nan],['day',1,'tablet(s)',np.nan],['day',2,'tablet(s)',np.nan]],columns=['dosagedurationunit','dosagequantityvalue','dosagequantityunit','quantityvalue'])
# Make the df 50000 rows
df = pd.concat([df]*10000).reset_index(drop=True)

##### Approach 1 #####
# This approach took on average 0.27553908449 seconds
df['NORMALIZED_DOSAGE'] = concat_df_cols_fast(df[['dosagequantityvalue', 'dosagequantityunit']], )

##### Approach 2 #####
# This approach took on average 5.92792463605 seconds
# replace nans with ''
df = df.replace(np.nan, '')
# concat value + unit
df['NORMALIZED_DOSAGE'] = df.dosagequantityvalue.astype(str) + ' ' + df.dosagequantityunit.astype(str)
# put nans back
df = df.replace(r'^\s*$', np.nan, limit=1, regex=True)

##### Approach 3 #####
# This approach took on average 27.7539046249 seconds
df = concat_df_cols(df, source_cols=['dosagequantityvalue', 'dosagequantityunit'],
                                  target_col='NORMALIZED_DOSAGE')

UPDATE: Refactored functions:


def concat_df_cols_new(df, sep=" ", na_rep=""):
    """ Add separator and replace NaN to empty space, while handling columns of different types.

    Args:
        df (dataframe): The dataframe to be modified, with only source_cols included**.
        source_cols (list): The columns to concatenate.
        sep (str): The separator with which to concatenate the columns.
        na_rep (str): The default replacement value for NaN values.
                      # Note, anything other than the default empty string will result in the
                        na_rep persisting after the concatentation.

    Returns:
        dataframe: The modified dataframe
    """
    df = df.replace(np.nan, sep, inplace=False)
    df = df.applymap(str)  # Convert cols to str to permit concatenation
    # Add separator and replace NaN to empty space
    # Convert to lists
    arr = df.values.tolist()
    # Replace empty spaces to NaN using list comprehension
    df = pd.Series([sep.join(x).strip(sep) for x in arr]).replace('^$', np.nan, regex=True)
    return df

def replace_concat_replace_new(df):
    df = df.replace(np.nan, '')
    s = df.dosagequantityvalue.astype(str) + ' ' + df.dosagequantityunit.astype(str)
    s = s.replace(r'^\s*$', np.nan, regex=True)
    s = s.replace(r'\s*$', '', regex=True)  # Trim trailing whitespace
    s = s.replace(r'^\s*', '', regex=True)  # Trim leading whitespace
    return s

df['NORMALIZED_DOSAGE_CONCAT'] = concat_df_cols_new(df[['dosagequantityvalue', 'dosagequantityunit']])
# 131.98 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['NORMALIZED_DOSAGE'] = replace_concat_replace_new(df[['dosagequantityvalue', 'dosagequantityunit']])
# 395.97 ms ± 28.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Ultimately, I'll go with concat_df_cols_new simply because I can use this function on dataframes with different column names and the runtime is on currently ~3x better. Unless there's a solution for those too..

Hmm interesting, my version is actually faster when I `%timeit`. I updated my answer with the `%timeit` results for my code vs `concat_df_cols_fast()`. — tdy, Apr 06 '21 at 22:10
That's interesting. I got a similar result to yours when I ran it again just now (different times but replace_concat_replace is ~1.7x as fast). I believe the difference compared to the initial runtime I recorded was due to the code running faster as a function. https://stackoverflow.com/questions/11241523/why-does-python-code-run-faster-in-a-function — bekfen, Apr 07 '21 at 03:06
Did some more refactoring and had them more or less in lockstep on small dfs. I've added the new function as concat_df_cols_new. I also had to add some logic to replace_concat_replace to strip the whitespace. Regex is very costly and results in the runtime for replace_concat_replace_new increasing dramatically on large dataframes. Ultimately, I'll go with concat_df_cols_new simply because I can use this function on dataframes with different column names and the runtime is on currently ~3x better. Unless there's a solution for those too.. Appreciate the help! — bekfen, Apr 07 '21 at 04:17
Oh nice link about function vs non-function speed. Re:stripping whitespace, I did another refactoring and got `replace_concat_replace()` to be faster again by replacing those 3 regexes with a single: `s = s.str.strip().replace('', np.nan)`. However, `concat_df_cols()` still seems like the best way to handle arbitrary columns. — tdy, Apr 07 '21 at 16:29
Yeah, didn't know about the speed difference of code in functions! Nice update, I thought there might be a better solution to the whitespace issue but agreed, handling arbitrary columns is a nice benefit of the concat_df_cols approach. Wish there was a way to add two generic cols together using col + ' ' + col. Thanks a lot for this iterative back and forth! — bekfen, Apr 07 '21 at 18:34

Combine (concatenate) pandas columns with missing values AND different types (str & int)

2 Answers2

Update

Original answer