0

I have the following string :

It reported the proportion of the edits made from America was 51% for the Wikipedia, and 25% for the simple Wikipedia.[142] The Wikimedia Foundation hopes to increase the number in the Global South to 37% by 2015.[143]

I am trying to replace every characters lik this .[xxx] with .[xxx] \n;

x are digits here

I am taking help from different stalk overflow answers; one such is :

Python insert a line break in a string after character "X"

Regex: match fullstop and one word in python

import re
str = "It reported the proportion of the edits made from America was 51% 
for the Wikipedia, and 25% for the simple Wikipedia.[142] The Wikimedia 
Foundation hopes to increase the number in the Global South to 37% by 
2015.[143] "
x = re.sub("\.\[[0-9]{2,5}\]\s", "\.\[[0-9]{2,5}\]\s\n",str)
print(x)

I expect the following output:

It reported the proportion of the edits made from America was 51% for the Wikipedia, and 25% for the simple Wikipedia.[142]                          
The Wikimedia Foundation hopes to increase the number in the Global South to 37% by 2015.[143]”

But I am getting:

It reported the proportion of the edits made from America was 51% for the Wikipedia, and 25% for the simple Wikipedia\\.\[[0-9]{2,5}\]\s   The Wikimedia Foundation hopes to increase the number in the Global South to 37% by 2015\\.\[[0-9]{2,5}\]\s
sahasrara62
  • 10,069
  • 3
  • 29
  • 44
Noor
  • 126
  • 2
  • 8

3 Answers3

1

You probably want to use capturing groups and back-referrences in re.sub. You also don't need to escape the replacement string (regex101):

import re
s = '''It reported the proportion of the edits made from America was 51% for the Wikipedia, and 25% for the simple Wikipedia.[142] The Wikimedia Foundation hopes to increase the number in the Global South to 37% by 2015.[143] '''
x = re.sub(r'\.\[([0-9]{2,5})\]\s', r'.[\1] \n', s)
print(x)

Prints:

It reported the proportion of the edits made from America was 51% for the Wikipedia, and 25% for the simple Wikipedia.[142] 
The Wikimedia Foundation hopes to increase the number in the Global South to 37% by 2015.[143] 
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • What is r'.[\1] \n' exactly doing? Please Explain. –  Jul 05 '19 at 07:20
  • @AlmightyHeathcliff `\1` is reference to first capturing group, in this case `([0-9]{2,5})` – Andrej Kesely Jul 05 '19 at 07:33
  • Thank you; I would like to know why the code is adding a new line after the word **Wikimedia**. I would like to have this: > The Wikimedia Foundation hopes to increase the number in the Global South to 37% by 2015.[143]” – Noor Jul 05 '19 at 08:04
  • @Noor becauise of formatting of input data. I updated my answer. – Andrej Kesely Jul 05 '19 at 08:13
  • @AndrejKesely , I used the regular expression which \1 is referencing, It then prints the regular expression. Aren't these two things supposed to be same? –  Jul 05 '19 at 08:47
  • @AlmightyHeathcliff No, its not same when you use regular expression in replacement string. You need to use the reference or you could use function in `re.sub` as well (depends on your case) – Andrej Kesely Jul 05 '19 at 09:11
1

You may use

(\.\[[^][]*\])\s*

And replace this with \1\n, see a demo on regex101.com.


This reads
(
    \.\[   # ".[" literally
    [^][]* # neither "[" nor "]" 0+ times
    \]     # "]" literally
)\s*       # consume whitespaces, eventually
Jan
  • 42,290
  • 8
  • 54
  • 79
  • The problem is that it also adds a new line after "[142]" even if there is no white space after [142] For e.g: it adds new line after [142] if the string is "Wikipedia.[142][152]" – Noor Jul 05 '19 at 11:47
1

Use findall() to identify list of matching patterns. Then you can replace it with original string+'\n'

Junior_K27
  • 151
  • 1
  • 9