Regex - Removing parts of URL path

Question

I am useless at Regex and I want to remove parts of a URL that are not always consistent.

The URL might be:

www.test.com /en/ restOfPath

or

www.test.com /en/en_gb/ restOfPath

Then depending on the country values might change to:

www.test.com /es/ restOfPath

or

www.test.com /es/es_es/ restOfPath

I am therefore looking to alway remove, the parts in bold, so that I can split the remained of the path, to create a logical naming that is language/location agnostic.

I am doing this as a work around to build out a data layer until the client can implement it properly when they launch their new website. I have managed to build an if else statement as a workaround which is a bit clunky but would like a cleaner solution.

Generally we want to help people who've been working on a solution to solve a problem. Have you tried a regex solution for this. If not maybe you should do some regex tutorials? — Alex Collins, Sep 12 '17 at 14:38
Possible duplicate of [How do I parse a URL into hostname and path in javascript?](https://stackoverflow.com/questions/736513/how-do-i-parse-a-url-into-hostname-and-path-in-javascript) — Tim Biegeleisen, Sep 12 '17 at 14:39
I'm not a JavaScript guru, but if you follow the link above you'll see that there are already some libraries out there which can help you to parse a URL/URI. I'd start by using those as much as possible, and only afterwards resort to using a regex. — Tim Biegeleisen, Sep 12 '17 at 14:40
i used to be useless at Regex as well. What helped me was experimenting with my problems on http://regexr.com/ until I found a solution that fit. Now I am not completely useless anymore. — ivospijker, Sep 12 '17 at 14:40
You have to get and use list of all those language abbreviations, otherwise regex doesn't have them. `lan1(?:_X1)?|lan2(?:_X2)?|lan3(?:_X3)?| ..` , etc.. — , Sep 12 '17 at 16:17
Thank you for the responses. @alex sorry if I was not clear on why I needed help, I was writing a temporary work around in a tech spec for a client to build a page name for analytics. The answer above does not solve my question and I have tried the two regex below but they don't seem to do it either. I built an if else statement as a work around, but it is a bit clumsy. — Roman Rock, Sep 12 '17 at 19:15
@RomanRock the regex I provided definitely matches the text in bold in your examples. See the example i provided in the link. What exactly is it you can't get working? — DNKROZ, Sep 13 '17 at 13:51

score 2 · Accepted Answer · answered Sep 12 '17 at 14:49

2

Probably this will help you

(?:\/([a-z]{2})(?:\/([a-z]{2}_[A-Z]{2}))?)

This example is about to find first / with two alpha after that, and probably another / with aa_AA construction.

I got you code samples at regex101

answered Sep 12 '17 at 14:49

Andrew Rumm

1,268
4
16
39

DNKROZ · Answer 2 · 2017-09-12T15:00:01.697

1

I believe this is what you're after:

\/.*(?=\/.*?)

https://regex101.com/r/OZIseI/4

It uses a positive look ahead to exclude the last / from the match

edited Sep 12 '17 at 15:00

answered Sep 12 '17 at 14:54

DNKROZ

2,634
4
25
43

Regex - Removing parts of URL path

2 Answers2