2

I have some text that is in the following format

\r\n
1. \r\n
par1 par1 par1 \r\n
\r\n
par1 par1 par1 \r\n
\r\n
2. \r\n
\r\n 
par2 par2 par2

What I want to do is to join them into paragraphs so that the end result would be:

1. par1 par1 par1 par1 par1 par1 \n
2. par2 par2 par2 \n

I have tried with multiple string manipulations such as str.split(), str.strip() and others, as well as searchign the internet for solutions but nothing seems to work.

Is there any easy way to do this programatically? The text is very long so doing by hand is out of the question.

srb
  • 81
  • 5
  • 2
    How do you decide which newlines to keep? You only want newlines that comes before a digit? – Håken Lid Oct 09 '18 at 09:05
  • I want to separate the text into paragraphs so that each paragraph begins with a number followed by a dot. However trying to match a number and a dot gives problems since there is other mathces to that inside of the actual text that are not paragraphs. – srb Oct 09 '18 at 09:18
  • You should find a rule that works everywhere, otherwise it's almost impossible to do that. Maybe matching a number preceded by a `'\n'` character? – toti08 Oct 09 '18 at 09:26

3 Answers3

2

Assuming your input text is stored in variable s, you can use the following generator expression with regex:

import re
print('\n'.join(re.sub(r'\s+', ' ', ''.join(t)).strip() for t in re.findall(r'^(\d+\.)(.*?)(?=^\d+\.|\Z)', s, flags=re.MULTILINE | re.DOTALL)))

This outputs:

1. par1 par1 par1 par1 par1 par1
2. par2 par2 par2
blhsing
  • 91,368
  • 6
  • 71
  • 106
1

Here is a slightly different approach using replace and re.

import re
# assuming d is the string you wanted to    parse 
d = """
\r\n
1. \r\n
par1 par1 par1 \r\n
\r\n
par1 par1 par1 \r\n
\r\n
2. \r\n
\r\n 
par2 par2 par2
"""

d = d.replace("\r", "").replace("\n", "")
d = re.sub(r'([0-9]+\.\s)\s*',r'\n\1', d).strip()
print(d)
Khanal
  • 788
  • 6
  • 14
0

I've used regex to find out all the words in the string and rejoined them based on the type of element in list. Hope this helps.

import re

line1 = '''\r\n
1. \r\n
par1 par1 par1 \r\n
\r\n
par1 par1 par1 \r\n
\r\n
2. \r\n
\r\n 
par2 par2 par2'''

line2 = re.findall(r"[\w']+", line1)

op = ""

def isInt(item):
    try:
        int(item)
        return True
    except ValueError:
        return False

for item in line2:
    if isInt(item):
        op += "\n" + item + ". "

    else:
        op += item + " "

print(op)

O/P

1. par1 par1 par1 par1 par1 par1 
2. par2 par2 par2 

Be wary of the extra \n in front of 1.

Vineeth Sai
  • 3,389
  • 7
  • 23
  • 34