0

I have long text files (.srt subtitle files, actually) - which unfortunately include a lot of irrelevant/distracting information.

All irrelevant text is enclosed within identical pairs of pilcrow (paragraph) characters: ¶

So for example, some text would look like this:

This is important, and ¶junk trash garbage rubbish¶ I would like to keep it.

Obviously, I want to remove everything between the ¶ characters and keep the rest. It doesn't matter whether the ¶ characters themselves are stripped or retained: if they're retained, it's trivial just to remove them directly with a subsequent search/replace - so I just need whatever pattern match is easiest.

Note that the ¶ symbols come in identical pairs, so it's not as simple as, for example, stripping out everything between [asymetrical characters].

I'm not working on any particular platform. In fact, I was hoping to use a web-based tool to do it like this one.

I just need the regex - if anyone can assist! Alternatively, if there are better ways than regex, I'd be grateful for suggestions.

Edit: It has been suggested that this question (Remove text in-between delimiters in a string (using a regex?)) answers what I'm looking for. Thanks, but unfortunately it doesn't. That relates to using it in C# (which I don't know), and the answers to that question do not explain exactly how to replicate what I want. I want it to work in the online tool to which I linked.

Update: A good answer works, but only if the unwanted text appears in-line. I also need it to remove text where the entire line is unwanted:

779 00:35:52,216 --> 00:35:54,784

I miss him already.

780 00:36:00,291 --> 00:36:03,727

¶ If you ever need someone ¶

665

00:30:21,821 --> 00:30:25,589

¶ Feels like

sometimes you want to ¶

So I want to remove everything which appears between the ¶ symbols, regardless of where they appeal in the line, and regardless of the presence of line breaks.

Second Update Subsequent to the accepted answer, it seems it's not entirely working. In the example here, the regex provided does not work in the first multi-line instance. I have no clue what's wrong. I just want line breaks (or any other characters) to be irrelevant in the consideration. The request is simply to delete everything between pairs of ¶ characters, regardless of where they appear, and regardless of what else lies between.

Final (hopefully) update

For reference, and thanks to user MDR, we have the solution: (¶[\S\s]*?¶)

Community
  • 1
  • 1
Chris Melville
  • 1,476
  • 1
  • 14
  • 30
  • 1
    That online tool you (OP) quoted seems to extract text. Maybe instead use a local text editor that has find and replace with a regex option and find `(¶.*?¶ )` and replace with nothing. Demo: https://regex101.com/r/4v9gXj/3 – MDR Apr 01 '20 at 21:16
  • @MDR - Thanks, this works perfectly! Post as an answer and I'll accept :) – Chris Melville Apr 01 '20 at 21:20
  • Np. Hope it help. Posted an answer. – MDR Apr 01 '20 at 21:24

2 Answers2

1

Updated because of new information in question and comments below this answer.

That online tool you quoted seems to extract text (perhaps not what you want here - you want to remove the bit found). Maybe instead use a local text editor (xed, Gedit, Textedit, TextWrangler, Visual Code Studio, Atom, NotePad++ on Windows etc.) that has find and replace but with a regex option and find...

(¶[\S\s]*?¶)

...and replace with nothing. Demo: https://regex101.com/r/4v9gXj/8

MDR
  • 2,610
  • 1
  • 8
  • 18
  • Thanks. Actually, I just noticed it only works when the ¶text¶ appears in the middle of a line. But it doesn't strip it out if the unwanted text comprises the entire line. Do you know how to fix this? – Chris Melville Apr 01 '20 at 21:28
  • Update the question with examples – MDR Apr 01 '20 at 21:28
  • @ChrisMelville It also doesn't work if there are two sets of delimited text in the same line. Please see (https://regex101.com/r/4v9gXj/4) – Ryan Wilson Apr 01 '20 at 21:29
  • Damnit! Doesn't yet work if there's a newline character in the middle of the junk text. Question updated again. – Chris Melville Apr 01 '20 at 21:38
  • @RyanWilson - you're right. Grateful if anyone can simply provide the answer to strip out everything between the symbols, regardless of line breaks, number of occurrences per line, and regardless of whether the entire line is garbage. I want line breaks to be an irrelvant consideration. – Chris Melville Apr 01 '20 at 21:42
  • Try: `(¶.*?¶|¶[\S\s]+¶)` – MDR Apr 01 '20 at 21:43
  • @MDR - sorry about this, but it doesn't seem to work in all cases. I have upddated the example at https://regex101.com/r/4v9gXj/6. Please take a look? It seems the FIRST multi-line instance breaks it. – Chris Melville Apr 03 '20 at 14:50
  • 1
    @ChrisMelville you need someone that is actually good at regex! ;o) tried to shorten and simplify with `(¶[\S\s]*?¶)` https://regex101.com/r/4v9gXj/8 – MDR Apr 03 '20 at 15:29
  • 1
    @MDR Hallelujah! :) Can you edit your actual answer to include this? Thanks! – Chris Melville Apr 03 '20 at 22:26
1

If I may suggest regexr.com. Use as pattern ¶.*?¶ and then switch to Replace section as the screenshot shows.

enter image description here

Themelis
  • 4,048
  • 2
  • 21
  • 45
  • Thanks, but that just seems to add `¶.*?¶` into the text. – Chris Melville Apr 01 '20 at 21:40
  • The regex goes at the top, in the middle is the original text with the matches from the pattern being highlighted and at the bottom is the text which replaced the matches with nothing. This is what you wanted, right? – Themelis Apr 01 '20 at 21:43
  • This works with in-line trext, but not when the entire line begins with the ¶ – Chris Melville Apr 01 '20 at 21:47