2

I have a data frame that I pulled from an SQL Server. The data got parsed incorrectly when converted to .csv and now I have columns that contain the wrong data. I'm trying to move everything back into place using pandas. Specifically, I have a field that should contain a "short description". Some of the descriptions were delimited into separate fields, and I'd like to concatenate them all into the appropriate field. The problem is, some of the fields contain dates that rightly belong there, and I need to skip over them when concatenating.

I've tried to use df.apply() in multiple different ways, but I can't seem to "skip over" the values that contain pd.Timestamp data types.

For example:

df_test.apply(lambda x: ' '.join(x) if type(x) != pd.Timestamp else '')

Example df:

df_so_test = pd.DataFrame([[1, 2, 'some description', pd.to_datetime('2019-01-01'), 'some more text', '']
                          , [2, 3, 'another description', 'some other text', '', pd.to_datetime('2019-01-02')]
                          , [3, 4, 'a third descirption', '', pd.to_datetime('2019-01-03'), pd.to_datetime('2019-01-04')]]
                          , columns=['random_col_1','random_col_2', 'short_desc', 'date_1', 'date_2', 'random_col_3'])

Expected output:

df_expected = pd.DataFrame([[1, 2, 'some description some more text', pd.to_datetime('2019-01-01'), '', '']
                          , [2, 3, 'another description some other text', pd.to_datetime('2019-01-02'), '', '']
                          , [3, 4, 'a third descirption', pd.to_datetime('2019-01-03'), pd.to_datetime('2019-01-04'), '']]
                          , columns=['random_col_1','random_col_2', 'short_desc', 'date_1', 'date_2', 'random_col_3'])
Akaisteph7
  • 5,034
  • 2
  • 20
  • 43
Brian
  • 35
  • 6

2 Answers2

2

Here an example using apply. The assumptions I need to make:

  1. I assume that the only column with strings object is 'short_desc', otherwise is much more difficult to understand which text goes in the 'short_desc' and which not, since I don't see a regular pattern in your misaligned data.

  2. I also assume that you have the two dates only to be shifted in place, when needed, and that your 'random_col_3' is generated by the wrong reading, so I'm going to drop it in the end.

You may need to fix the names of your real columns, if they do not correspond to the posted example.

def fixdb(row):
    found = [x for x in row if isinstance(x, str)]
    if len(found) > 1:
        row['short_desc'] = ' '.join(found)
        dates = [x for x in row if isinstance(x, pd.Timestamp)]

        try:
            row['date_1'] = dates[0]
        except IndexError:
            row['date_1'] = np.nan

        try:
            row['date_2'] = dates[1]
        except IndexError:
            row['date_2'] = np.nan

    return row

df_out = df_so_test.apply(fixdb, axis=1).drop('random_col_3', axis=1)

This is df_out using the provided output:

   random_col_1  random_col_2                            short_desc     date_1     date_2
0             1             2       some description some more text 2019-01-01        NaT
1             2             3  another description some other text  2019-01-02        NaT
2             3             4                  a third descirption  2019-01-03 2019-01-04
Valentino
  • 7,291
  • 6
  • 18
  • 34
2

Here is one way to do it:

def f(y):
    desc = ['' if pd.isnull(x) else x if type(x)!=pd.Timestamp else '' for x in y]
    return desc

res = df_so_test[df_test.columns[2:]].apply(f)
res["new"] = res["short_desc"]+" "+res["date_1"]+" "+res["date_2"]+" "+res["random_col_3"]
df_so_test["short_desc"]= res["new"].apply(lambda x: re.sub("\s+", " ", x))

def f(y):
    times = ["" if pd.isnull(x) else x if type(x)==pd.Timestamp else '' for x in y]
    return times

res = (df_so_test[df_so_test.columns[3:]].apply(f)).to_numpy()
times = [[x for x in y if not np.isnat(x)] for y in res]
[a.extend([''] * (3 - len(a))) for a in times]

df_expected = df_test.copy()
df_expected[df_expected.columns[-3:]] = times

print(df_expected)

Output:

   random_col_1  random_col_2                            short_desc     date_1     date_2 random_col_3
0             1             2      some description some more text  2019-01-01        NaT
1             2             3  another description some other text  2019-01-02        NaT
2             3             4                  a third descirption  2019-01-03 2019-01-04
Akaisteph7
  • 5,034
  • 2
  • 20
  • 43