0

I have a bunch of html files where the filename in the img tag contains whitespaces (). I need to replace whitespaces with underscores (_) in a text editor. I am using this regular expression:

(?<=\/img\/)(\s)(?=.png")

but it doesn't work! Here an example with expected result:

<img src="./img/setup3oval 7  1.png"/>

expected result:

<img src="./img/setup3oval_7__1.png"/>

Any help is very much appreciated

Blue
  • 22,608
  • 7
  • 62
  • 92
Meloide
  • 31
  • 3
  • Can you provide some sample inputs and their respective expected outputs? – Pedro Corso Jan 31 '18 at 11:04
  • Be careful with regular expressions around HTML, see https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la and https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Micha Wiedenmann Jan 31 '18 at 11:07
  • 1
    Your regex matches `/img/ 1png"`, this is a) between `/img/` and `.png` there can only be a space, nothing else b) you use `.` instead of `\.`. Useful: https://regex101.com/r/CFEW9j/1 – Micha Wiedenmann Jan 31 '18 at 11:10
  • share your code. – Hemant Parmar Jan 31 '18 at 11:18
  • @Cath yes I have more than one space, and no (\s)+ doesn't work – Meloide Jan 31 '18 at 11:28
  • Welcome to Stack Overflow! [You do not need to mark questions as "SOLVED" via editing the title](//meta.stackexchange.com/a/116105/295637), or [posting updates/thanks in posts](//meta.stackexchange.com/a/109959/295637). See **[What should I do when someone answers my question?](//stackoverflow.com/help/someone-answers)**. Simply marking an answer as accepted will mark this question as solved for future readers. Anything additional can be perceived as noise for future visitors. – Blue Jan 31 '18 at 13:50

1 Answers1

0

Due to the limitations imposed by the regular expressions, it's not possible to do what you're asking for using a regex in a single run. But you can do it partially:

(?<=src=")(.*?)\s+(.*?)(?=\.)

You've mentioned that you're making use of a text editor to run this regex. If you're using something like Notepad++, you should be able to click the replace button multiple times until you reach the expected result, replacing your text by $1_$2. It shouldn't be too much of a problem if your image file paths doesn't have too much whitespaces in between them.

Explanation of the regex:

  • (?<=src=") - This is a positive look behind, used to match only strings that are preceeded by this pattern. I'm using the src property as a reference instead of the <img> tag.
  • (.*?)\s+(.*?)- This matches any whitespaces between two blocks of text. I've used lazy quantifiers to avoid wrong matches. I've also wrapped these blocks on capturing groups to use them on substitution.
  • (?=\.) - This is a positive lookahead. The text will match until it reaches a dot character, literally. That is, assuming that there won't be any other dots on the line. You should change this assertion if that's not the case.

Demonstration: regex101.com

I've also tested this regex on Notepad++, hitting the substitution button multiple times. The expected results were achieved.

Pedro Corso
  • 557
  • 8
  • 22
  • Big up for Pedro! Thank you very very much! I hope I will have the chance to help you back one day! I am using sublime text and it works great! – Meloide Jan 31 '18 at 13:45