1

I am a Python noob. I have an unstructured text file that I'm trying to capture to a dataframe and export to excel. I need to merge 38 to 36, 45 to 43, and 79 to 78 filling in the empty space with the data on the merging column.

Dummy Dataset

0 5 36 38 43 45 78 79
1 A 01JUN2022 1.2 B 1.2
2 C 01JUN2022 1.4 D 1.4
3 E 01JUN2022 1.5 F 1.6
4 G 01JUN2022 1.7 H 1.7
5 I 01JUN2022 1.4 J 1.8
6 K 01JUN2022 1.7 L 1.3
1 A 01JUN2022 1.2 B 1.2
2 C 01JUN2022 1.4 D 1.4
3 E 01JUN2022 1.5 F 1.6
4 G 01JUN2022 1.7 H 1.7
5 I 01JUN2022 1.4 J 1.8
6 K 01JUN2022 1.7 L 1.3

Required output

0 5 36 43 79
1 A 01JUN2022 1.2 B 1.2
2 C 01JUN2022 1.4 D 1.4
3 E 01JUN2022 1.5 F 1.6
4 G 01JUN2022 1.7 H 1.7
5 I 01JUN2022 1.4 J 1.8
6 K 01JUN2022 1.7 L 1.3
1 A 01JUN2022 1.2 B 1.2
2 C 01JUN2022 1.4 D 1.4
3 E 01JUN2022 1.5 F 1.6
4 G 01JUN2022 1.7 H 1.7
5 I 01JUN2022 1.4 J 1.8
6 K 01JUN2022 1.7 L 1.3
Dhakshika
  • 49
  • 7
  • Does this answer your question? [How to remove nan value while combining two column in Panda Data frame?](https://stackoverflow.com/questions/34989341/how-to-remove-nan-value-while-combining-two-column-in-panda-data-frame) – T C Molenaar Oct 05 '22 at 09:19
  • I tried all the answers in that but I did not get the required output – Dhakshika Oct 05 '22 at 09:42
  • Is it guaranteed that only 1 of the 2 columns being merged will have a value present? i.e is it possible for both columns to hold a value for the same row? – Abirbhav G. Oct 05 '22 at 10:45
  • Yes only one column will have the data. – Dhakshika Oct 05 '22 at 11:00

1 Answers1

0

Would start by converting '' to NaN as follows

df = df.replace(r'^\s*$', np.nan, regex=True)

Then one can use pandas.Series.combine_first

df['36'] = df['36'].combine_first(df['38'])
df['43'] = df['43'].combine_first(df['45'])
df['79'] = df['79'].combine_first(df['78'])

[Out]:
    id  0          5   36   38 43   45   78   79
0    1  A  01JUN2022  1.2  1.2  B    B  NaN  1.2
1    2  C  01JUN2022  1.4  1.4  D    D  NaN  1.4
2    3  E  01JUN2022  1.5  NaN  F  NaN  1.6  1.6
3    4  G  01JUN2022  1.7  NaN  H  NaN  1.7  1.7
4    5  I  01JUN2022  1.4  NaN  J  NaN  1.8  1.8
5    6  K  01JUN2022  1.7  NaN  L  NaN  1.3  1.3
6    1  A  01JUN2022  1.2  1.2  B    B  NaN  1.2
7    2  C  01JUN2022  1.4  1.4  D  NaN  1.4  1.4
8    3  E  01JUN2022  1.5  1.5  F  NaN  1.6  1.6
9    4  G  01JUN2022  1.7  NaN  H  NaN  1.7  1.7
10   5  I  01JUN2022  1.4  NaN  J    J  NaN  1.8
11   6  K  01JUN2022  1.7  NaN  L  NaN  NaN  1.3

Finally, one can drop the columns that one doesn't want or select the one's to display as follows

df = df[['0', '5', '36', '43', '79']]

[Out]:

    0          5   36 43   79
0   A  01JUN2022  1.2  B  1.2
1   C  01JUN2022  1.4  D  1.4
2   E  01JUN2022  1.5  F  1.6
3   G  01JUN2022  1.7  H  1.7
4   I  01JUN2022  1.4  J  1.8
5   K  01JUN2022  1.7  L  1.3
6   A  01JUN2022  1.2  B  1.2
7   C  01JUN2022  1.4  D  1.4
8   E  01JUN2022  1.5  F  1.6
9   G  01JUN2022  1.7  H  1.7
10  I  01JUN2022  1.4  J  1.8
11  K  01JUN2022  1.7  L  1.3

and this gives the desired output.


Notes:

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83