Concatenating 3+ columns only if their values are not of a certain data type

Question

I have a data frame that I pulled from an SQL Server. The data got parsed incorrectly when converted to .csv and now I have columns that contain the wrong data. I'm trying to move everything back into place using pandas. Specifically, I have a field that should contain a "short description". Some of the descriptions were delimited into separate fields, and I'd like to concatenate them all into the appropriate field. The problem is, some of the fields contain dates that rightly belong there, and I need to skip over them when concatenating.

I've tried to use df.apply() in multiple different ways, but I can't seem to "skip over" the values that contain pd.Timestamp data types.

For example:

df_test.apply(lambda x: ' '.join(x) if type(x) != pd.Timestamp else '')

Example df:

df_so_test = pd.DataFrame([[1, 2, 'some description', pd.to_datetime('2019-01-01'), 'some more text', '']
                          , [2, 3, 'another description', 'some other text', '', pd.to_datetime('2019-01-02')]
                          , [3, 4, 'a third descirption', '', pd.to_datetime('2019-01-03'), pd.to_datetime('2019-01-04')]]
                          , columns=['random_col_1','random_col_2', 'short_desc', 'date_1', 'date_2', 'random_col_3'])

Expected output:

df_expected = pd.DataFrame([[1, 2, 'some description some more text', pd.to_datetime('2019-01-01'), '', '']
                          , [2, 3, 'another description some other text', pd.to_datetime('2019-01-02'), '', '']
                          , [3, 4, 'a third descirption', pd.to_datetime('2019-01-03'), pd.to_datetime('2019-01-04'), '']]
                          , columns=['random_col_1','random_col_2', 'short_desc', 'date_1', 'date_2', 'random_col_3'])

Tried, but couldn't find the best way to post a table. Suggestions? — Brian, Aug 02 '19 at 19:48
here it is : https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — anky, Aug 02 '19 at 19:52
After looking at you test, and expected df, I believe it would be a lot easier just to change the way you import the data into a df. Maybe pull directly from sql_server using pd.read_sql instead of through csv? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html — Ben Pap, Aug 02 '19 at 21:47
@BenPap This is the right answer! Thanks for simplifying things for me, haha. I was thinking too hard about it, I guess. — Brian, Aug 05 '19 at 13:53

score 2 · Accepted Answer · answered Aug 02 '19 at 21:49

Here an example using apply. The assumptions I need to make:

I assume that the only column with strings object is 'short_desc', otherwise is much more difficult to understand which text goes in the 'short_desc' and which not, since I don't see a regular pattern in your misaligned data.
I also assume that you have the two dates only to be shifted in place, when needed, and that your 'random_col_3' is generated by the wrong reading, so I'm going to drop it in the end.

You may need to fix the names of your real columns, if they do not correspond to the posted example.

def fixdb(row):
    found = [x for x in row if isinstance(x, str)]
    if len(found) > 1:
        row['short_desc'] = ' '.join(found)
        dates = [x for x in row if isinstance(x, pd.Timestamp)]

        try:
            row['date_1'] = dates[0]
        except IndexError:
            row['date_1'] = np.nan

        try:
            row['date_2'] = dates[1]
        except IndexError:
            row['date_2'] = np.nan

    return row

df_out = df_so_test.apply(fixdb, axis=1).drop('random_col_3', axis=1)

This is df_out using the provided output:

   random_col_1  random_col_2                            short_desc     date_1     date_2
0             1             2       some description some more text 2019-01-01        NaT
1             2             3  another description some other text  2019-01-02        NaT
2             3             4                  a third descirption  2019-01-03 2019-01-04

Thanks! This is just what I was trying to do. – Brian Aug 05 '19 at 13:54 — Brian, Aug 05 '19 at 13:54

score 2 · Answer 2 · answered Aug 03 '19 at 20:18

Here is one way to do it:

def f(y):
    desc = ['' if pd.isnull(x) else x if type(x)!=pd.Timestamp else '' for x in y]
    return desc

res = df_so_test[df_test.columns[2:]].apply(f)
res["new"] = res["short_desc"]+" "+res["date_1"]+" "+res["date_2"]+" "+res["random_col_3"]
df_so_test["short_desc"]= res["new"].apply(lambda x: re.sub("\s+", " ", x))

def f(y):
    times = ["" if pd.isnull(x) else x if type(x)==pd.Timestamp else '' for x in y]
    return times

res = (df_so_test[df_so_test.columns[3:]].apply(f)).to_numpy()
times = [[x for x in y if not np.isnat(x)] for y in res]
[a.extend([''] * (3 - len(a))) for a in times]

df_expected = df_test.copy()
df_expected[df_expected.columns[-3:]] = times

print(df_expected)

Output:

   random_col_1  random_col_2                            short_desc     date_1     date_2 random_col_3
0             1             2      some description some more text  2019-01-01        NaT
1             2             3  another description some other text  2019-01-02        NaT
2             3             4                  a third descirption  2019-01-03 2019-01-04

Concatenating 3+ columns only if their values are not of a certain data type

2 Answers2