How to join newlines into a paragraph in python

Question

I have some text that is in the following format

\r\n
1. \r\n
par1 par1 par1 \r\n
\r\n
par1 par1 par1 \r\n
\r\n
2. \r\n
\r\n 
par2 par2 par2

What I want to do is to join them into paragraphs so that the end result would be:

1. par1 par1 par1 par1 par1 par1 \n
2. par2 par2 par2 \n

I have tried with multiple string manipulations such as str.split(), str.strip() and others, as well as searchign the internet for solutions but nothing seems to work.

Is there any easy way to do this programatically? The text is very long so doing by hand is out of the question.

How do you decide which newlines to keep? You only want newlines that comes before a digit? — Håken Lid, Oct 09 '18 at 09:05
I want to separate the text into paragraphs so that each paragraph begins with a number followed by a dot. However trying to match a number and a dot gives problems since there is other mathces to that inside of the actual text that are not paragraphs. — srb, Oct 09 '18 at 09:18
You should find a rule that works everywhere, otherwise it's almost impossible to do that. Maybe matching a number preceded by a `'\n'` character? — toti08, Oct 09 '18 at 09:26

score 2 · Answer 1 · answered Oct 09 '18 at 09:13

Assuming your input text is stored in variable s, you can use the following generator expression with regex:

import re
print('\n'.join(re.sub(r'\s+', ' ', ''.join(t)).strip() for t in re.findall(r'^(\d+\.)(.*?)(?=^\d+\.|\Z)', s, flags=re.MULTILINE | re.DOTALL)))

This outputs:

1. par1 par1 par1 par1 par1 par1
2. par2 par2 par2

score 1 · Answer 2 · answered Oct 09 '18 at 09:31

Here is a slightly different approach using replace and re.

import re
# assuming d is the string you wanted to    parse 
d = """
\r\n
1. \r\n
par1 par1 par1 \r\n
\r\n
par1 par1 par1 \r\n
\r\n
2. \r\n
\r\n 
par2 par2 par2
"""

d = d.replace("\r", "").replace("\n", "")
d = re.sub(r'([0-9]+\.\s)\s*',r'\n\1', d).strip()
print(d)

score 0 · Answer 3 · answered Oct 09 '18 at 09:05

0

I've used regex to find out all the words in the string and rejoined them based on the type of element in list. Hope this helps.

import re

line1 = '''\r\n
1. \r\n
par1 par1 par1 \r\n
\r\n
par1 par1 par1 \r\n
\r\n
2. \r\n
\r\n 
par2 par2 par2'''

line2 = re.findall(r"[\w']+", line1)

op = ""

def isInt(item):
    try:
        int(item)
        return True
    except ValueError:
        return False

for item in line2:
    if isInt(item):
        op += "\n" + item + ". "

    else:
        op += item + " "

print(op)

O/P

1. par1 par1 par1 par1 par1 par1 
2. par2 par2 par2

Be wary of the extra \n in front of 1.

answered Oct 09 '18 at 09:05

Vineeth Sai

3,389
7
23
34

1

I tried this and forthe example I gave it works well. Unfortunately the actual text has numbers inside of it so it doesnt Split correctly. – srb Oct 09 '18 at 09:17
Oh okay. Didn't take that into consideration :) – Vineeth Sai Oct 09 '18 at 09:25

How to join newlines into a paragraph in python

3 Answers3

Linked