1

First, I'm using EditPadPro for my regex cleaning, so any answers given should work within that environment.

I get a large spreadsheet full of data that I have to clean every day. I've managed to get it down to a couple of different regexes that I run, and this works... but I'm curious to see if it's possible to reduce down to a single regex.

Here is some sample data:

3-CPC_114851_70095_70095_CAN-bre
3-CPC_114851_70095_70095_CAN
b11-ao1-113775-bre
b7-ao-114441
b7-ao-114441-bre
b7-ao1-114441
b7-ao1-114441-bre
http://go.nlvid.com/results1/?http://bo
go.nlv/results1/?click
b4-sm-1359
b6-sm-1356-bre
1359_195_1453814569-bre
1356_104_1456856729
b15-rad-8905
b15-rad-8905-bre

Here is how the above data needs to end up:

114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre

So, there are numerous rules, such as:

  • In cases of more than 2 underscores, the result needs to contain only the value immediately after the first underscore, and everything from the dash onwards.
  • In cases where the string contains "-ao-", "-ao1-", everything prior to the final numeric string should be removed.
  • If a question mark is present, everything from the mark onwards should be removed.
  • If the string contains "-sm-" or "-rad-", everything prior to those alpha strings should be removed.
  • If the string contains 2 underscores, averything after the first numeric string up to a dash (if present) should be removed, and the string "sm-" should be prepended.

Additionally there is other data that must be left untouched, including but not limited to:

113535|24905|24905

as well as many variations on this pattern of xxxxxx|yyyyy|zzzzz (and not always those string lengths)

This may be asking way too much of regex, I'm not sure as I'm not great with it. But I've seen some pretty impressive things done with it, so I thought I'd put this out to the community and see what you come back with.

JVC
  • 793
  • 3
  • 8
  • 21
  • Hey Jonathan, for completion I have edited my answer and given you a solution to do what you want in EPP. We save two regexes to favorites, then record a short macro. It's really easy and there is no other way to get the last regex in because of "sm-". Hope this does it for you, let me know if you have questions. – zx81 Apr 26 '14 at 07:06
  • Also, fixed a couple of typos in the regex (there were a couple where I replaced the match but hadn't bothered to match the end of the string to be deleted.) Pasted the output of the macro for you to see. – zx81 Apr 26 '14 at 07:30

2 Answers2

1

Try this:

  1. Toggle the Search Panel : SHIFT+CTRL+F
  2. SEARCH: .*?((?:sm-|rad-)?(?:(?:\d+|[\w\.]+\/.*?))(?:-\w+)?$)
  3. REPLACE: $1
  4. Check REGEX and WORDS
  5. Click Replace All or Hit CTRL+ALT+F3

Check the image below:

EditPad Search and Repalace

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • Thanks but I'm not sure what you mean about it needing modifiers. How would I incorporate those into the regex? – JVC Apr 24 '14 at 19:23
  • Give a couple of minutes and I'll investigate that for u – Pedro Lobito Apr 24 '14 at 19:33
  • @JonathanvanClute Please try the regex without any modifiers an check if it works. – Pedro Lobito Apr 24 '14 at 19:35
  • I did try it, it doesn't appear to do anything. It highlights the results it's searching for and it appears to be highlighting exactly the opposite of what I need, but then the replacement just appears to replace it with itself, so in effect nothing happens. – JVC Apr 24 '14 at 19:37
  • @JonathanvanClute ok I'll take a look at it. – Pedro Lobito Apr 24 '14 at 19:39
  • @JonathanvanClute I also have EditPad so It was easier to test, check it out. – Pedro Lobito Apr 24 '14 at 19:52
  • Wow! That's amazing to me. :) Unfortunately there is one more data element I left out, and that is a fully formed URL with partial query string, such as `http://go.nlvid.com/results1/?http://bo`. This does not get handled correctly... do you think you can accommodate it? Also it does not handle `b2-ao1-114441` properly. The 'ao1' appears to throw it off. – JVC Apr 24 '14 at 20:07
  • Oh and lastly, there is other data that needs to be left alone, such as `113535|24905|24905` and currently it is altering these strings. And it's not handling `1359_195_1453814569-bre` correctly either. – JVC Apr 24 '14 at 20:13
1

Jonathan, I can wrap all of those into one regex, except the last one (where you prepend sm- to a string that does not contain sm). It is not possible in this context, because we cannot capture "sm" to reuse in the replacement, and because there is no "conditional replacement" syntax in EPP.

That being said, you can achieve what you want in EPP with two regexes and one macro to chain the two.

Here is how.

The solution below is tested in EPP.

Regex 1

  1. Press Ctrl + Sh + F to enter Search / Replace mode
  2. Enter the following Search and Replace in the appropriate boxes
  3. At the top right of the Search bar, click the Favorite Searches pull-down, select "Add", give it a name, e.g. Regex 1

Search:

(?mx)^
(?=(?:[^_\r\n]*?_){3})[^_\r\n]+?_([^_\r\n]+)[^-\r\n]+(-[^\r\n]+)?
|
[^\r\n]*?-ao1?-\D*([^\r\n]+)
|
([^\r\n?]*)(?=\?)[^\r\n]+
|
[^\r\n]*?-((?:sm|rad)-[^\r\n]+)

Replace:

\1\2\3\4\5

Regex 2

Same 1-2-3 steps as above.

Search

^(?!(?:[^_\r\n]*?_){3})(?=(?:[^_\r\n]*?_){2})(\d+)(?:[^-\r\n]+(-[^\r\n]+)?)

Replace

sm-\1\2

Chaining Regex 1 and Regex 2

  1. Top menu: Macros, Record Macro, give it a name.
  2. Click the Favorite searches pulldown, select Regex 1
  3. Hit Replace All.
  4. Click the Favorite searches pulldown, select Regex 2
  5. Hit Replace All.
  6. Macros, Stop recording.
  7. Whenever you want to do your sequence of replacements, pull it by name under the Macros menu.

Testing This

I have tested my "Jonathan macro" on your input. Here is the result:

114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre
zx81
  • 41,100
  • 9
  • 89
  • 105
  • Interesting, thanks for your attempt! Even though it doesn't quite do what I wanted, I can hopefully learn a thing or two from your response. =) – JVC Apr 24 '14 at 23:38
  • @JonathanvanClute The author of EPP and regex uberexpert (Jan) confirmed my suspicion that there is no known trick to capture text that is not in the string: `I don't know of any regex engine that allows you to capture text that is not actually matched. You can use capturing groups inside lookaround to capture text that is not included in the overall match, but that text still needs to be matched while the lookaround is attempted.` Let me know if you have any questions Jonathan. Your original question was what was possible, and IMO this answer gets you the closest you can get in EPP. – zx81 Apr 25 '14 at 10:10
  • That's interesting, thanks for the update. Unfortunately both answers given work in some ways, and fail in others. Yours failed to handle the following - `3-CPC_114851_70095_70095_CAN`, `http://go.nlvid.com/results1/?http://bo`, `go.nlv/results1/?click`, and `1356_104_1456856729` (on the last, it not only failed to prepend sm- as you explained cannot be done, but it didn't actually process the string at all). So frankly I have to answers that both partially work. I'm going to wait a bit before closing this to see if anyone comes up with anything closer to what I'm hoping for. – JVC Apr 25 '14 at 16:50
  • Oh wow, I was unfamiliar with using Macros in EPP before this. With that feature I can accomplish this already with the regexes I have. Thanks for all your effort! – JVC Apr 26 '14 at 16:50
  • @JonathanvanClute You're welcome, glad it worked. Yes macros are easy to set up, aren't they? Btw if for any reason there's a long macro you need to edit a lot, instead of re-recording you can export it, open the file, edit inside, import again. – zx81 Apr 26 '14 at 18:51
  • Oh that is an excellent tip! I did indeed already re-record my ultimate macro about 5 times. =) – JVC Apr 26 '14 at 19:30