2

I have this :

The Daily Eastern News is a student-run newspaper published for the community of Eastern Illinois University in Charleston, Illinois. The newspaper was founded in 1915 http://media. www. dennews. com/media/storage/paper309/news/2005/11/04/News/The-News. Turns.90-1045667. shtml and publishes on weekdays during the school year and twice-weekly in the summer.

The paper has won numerous state and national awards, including several Pacemaker awards. http://search. atomz. com/search/?sp_a=sp01089f00&sp_f=iso-8859-1&sp_q=%22daily+eastern+news%22 The paper's editorial, production, and advertising staff are composed entirely of students from a range of degree programs.

I want to remove the space from the bold parts in above paragraph.

Expected Output:

The Daily Eastern News is a student-run newspaper published for the community of Eastern Illinois University in Charleston, Illinois. The newspaper was founded in 1915 http://media.www.dennews.com/media/storage/paper309/news/2005/11/04/News/The-News.Turns.90-1045667.shtml and publishes on weekdays during the school year and twice-weekly in the summer.

The paper has won numerous state and national awards, including several Pacemaker awards. http://search.atomz.com/search/?sp_a=sp01089f00&sp_f=iso-8859-1&sp_q=%22daily+eastern+news%22 The paper's editorial, production, and advertising staff are composed entirely of students from a range of degree programs.

Regex I tried :

([(http://(.)\.)|(www\.)])\s

Replace with

$1
Wazy
  • 8,822
  • 10
  • 53
  • 98
iNikkz
  • 3,729
  • 5
  • 29
  • 59

1 Answers1

1

Check the following regex

search :

(?=\. [a-zA-Z1-9\. \-]*?com)\. 

replace :

.

This will find all .[space] followed by a com/ where there is no non-english letter in between since all domains are english letter or numbers generally, fits your case but may include some more chars to ensure that all domain names are covered if you have more text, and replace the .[space] with a dot.

Update The above solution only works for spaces before .com, if you need to replace all occurances .[space] in a full string of the url including the trailing path, its a good idea to use the 'http://' part , however for this, since lookbehinds are of zero size, we will need to do a reverse of the string first

And apply the followin regex on the reverse for search part,

 (?=[a-zA-Z0-9\/ \.\-]+\/\/:ptth) \.

replace this with .

Then reverse the string back again , this can be easily done in python

Community
  • 1
  • 1
arkoak
  • 2,437
  • 21
  • 35