0

I have a txt file with <i> and </i> between words that I would like to remove using Editpad

For example, I'd like to keep when it's like this:

<i>Phrases and words.</i>

And I'd like to remove the </i> and <i> tags inside the phrase, when it's like this:

<i>Phrases</i>and<i> words.</i>
<i>Phrases</i>and <i>words.</i>

I was trying to do that using regex, but I couldn't do it.

As the tag is followed by space or a word character I could find when the line has the double tag with

/ <i>|<\/i> /

but this way I can't just press replace for nothing, I have to edit line by line I search.

There's anyway to accomplish that?

* Edited *

Another example of lines found on the subtitle text

<i>- find me on the chamber.</i>
- What? <i>Go. Go, go, go!</i>
Commentator
  • 640
  • 1
  • 6
  • 22

1 Answers1

1

Rule number one: you can't parse html with regex.

That being said, if you know each line follows a certain pattern, you can usually hack something together to work. ;)

If I've understood correctly, it looks like you can simply remove all <i> and </i> that aren't either at the beginning or end of the lines. In that case, one method you could try is the following regex:

(?<=.)\<\/?i\>(?=.)

This will match the tags, with a lookahead and behind to make sure that we aren't at the end/start of a line (by checking if another character exists in front/behind. (Note that typically matched characters in a lookahead/behind won't be replaced when you search/replace.)

Disclaimer: this works on regex101, but notepad++ may have some differences to the pcre regex style.

update to work with Editpad

EDIT: since this question is actually wanting to know how to do this in Editpad, below is a modified alternative:

Try searching for the regex: (.)\<\/?i\>(.). This will match (and capture) exactly one character before and after the <i> tags.

When replacing, use backreferences to replace the entire match with the two captured characters - a replacement string of \1\2 should work.

  • Thank you for your reply. It's a subtitle file. Unfortunately it didn't work. I'm using EditPad a similar program to Notepad++. I believe the regex from these programs are the javascript regex style – Commentator Jun 09 '17 at 01:27
  • @Comentarist why did you tag your question with `notepad++` then? Two alternatives: use notepad++ or another more powerful editor to do this particular operation, or modify this regex to work with javascript style regex (regex101 says lookbehinds aren't in js regex) –  Jun 09 '17 at 01:31
  • I tagged because a code that work on notepad++ might work on my too, which is not the case. Can you modify this regex to work with javascript style? If I could, I would. – Commentator Jun 09 '17 at 01:36
  • It didn't work properly, but thank you. I believe I can do the job this way almost as manually. Sometimes it is matching 2 captured characters before and the backreference \1\2 replacement eats 1 letter. By my fault I had to change the code to `[^- ](.)\<\/?i\>(.)` because the dialogue lines with `- Text...` was matching – Commentator Jun 09 '17 at 03:19
  • 1
    @Comentarist If you are adding the `[^- ]` part to the start, you could put that within the first parentheses to make `([^- ].)...`, so it won't eat that first character. –  Jun 09 '17 at 03:37