2

Background

I have the following sample df

import pandas as pd
df = pd.DataFrame({'Text' : ['\n[STUFF]\nBut the here is \n\nBase ID : 00000 Date is Here \nfollow\n', 
                                   '\n[OTHER]\n\n\nFound Tom Dub \nhere\n  BATH # : E12-34567 MR # 000', 
                                   '\n[ANY]\nJane Ja So so \nBase ID : 11111 Date\n\n\n hey the \n\n  \n    \n\n\n'],
                    'Alt_Text' : ['[STUFF]But the here is Base ID : *A* Date is Here follow', 
                                   '[OTHER]Found *B* *B* here BATH # : *A* MR # *C*', 
                                   '[ANY]*B* *B*So so Base ID : *A* Date hey the '],


                      'ID': [1,2,3]

                     })

Goal

1) Create a new column New_Text that 2) regains the original linebreaks \n present in the Text column but contains the contents from the Alt_Text column

Example

Text Column, Row 0:

\n[STUFF]\nBut the here is \n\nBase ID : 00000 Date is Here \nfollow\n  

Alt_Text Column, Row 0:

[STUFF]But the here is Base ID : *A* Date is Here follow

Would like

\n[STUFF]\nBut the here is \n\nBase ID : *A*  Date is Here \nfollow\n   

Desired Output

   Text Alt_Text ID New_Text 
0                   \n[STUFF]\nBut the here is \n\nBase ID :  *A*  Date is Here \nfollow\n  
1                   \n[OTHER]\n\n\nFound *B* *B*  \nhere\n BATH # : *A*  MR # *C*   
2                   \n[ANY]\nJ*B* *B* So so \nBase ID : *A*  Date\n\n\n hey the \n\n \n \n\n\n

Tried

I have looked around SO including Wrap multiline string (preserving existing linebreaks) in Python? and Read Excel data using Pandas and retaining the line break of a cell value amongst many others and none seem to be what I am looking to do.

Question

How do I achieve my desired output?

SFC
  • 733
  • 2
  • 11
  • 22

1 Answers1

1

We regex split Text and Alt_Text using capturing parentheses in the pattern:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

Then we zip both lists taking separators containing line breaks from Text and anything else from Alt_Text and join the resulting list into New_Text:

def insert_line_breaks(text, alt_text):
    regex = re.compile(r'([^ \n\[\]]+)')
    text = regex.split(text)
    alt_text = regex.split(alt_text)
    return ''.join([t if '\n' in t else a for t,a in zip(text,alt_text)])

df['New_Text'] = df.apply(lambda r: insert_line_breaks(r.Text, r.Alt_Text), axis=1)

I guess there should be a space between the second *B* and So in the last row of Alt_Text and the J before the first *B* in the desired output is just a typo. In this case we get:

>>> df.New_Text
0            \n[STUFF]\nBut the here is \n\nCase ID : *A* Date is Here \nfollow\n
1                    \n[OTHER]\n\n\nFound *B* *B* \nhere\n  BATH # : *A* MR # *C*
2    \n[ANY]\n*B* *B* So so \nCase ID : *A* Date\n\n\n hey the \n\n  \n    \n\n\n
Stef
  • 28,728
  • 2
  • 24
  • 52