0

I have a string url like "home/products/product_name_1/details/some_options" And i want to parse it into array with Regexp to ["home", "products","product","details","some"]

So the rule is "split by words if backslash, but if the word have underscores - take only that part that comes before first underscore"

JavaScript equivalent for this regex is

str.split("/").map(item => item.indexOf("_") > -1 ? item.split("_")[0] : item)

Please help!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Alexander Guskov
  • 371
  • 1
  • 2
  • 14
  • What is your regex tool or language? We need to know this in order to be able to help you here. – Tim Biegeleisen Feb 03 '22 at 07:09
  • To rephrase your question in URL-specific terms: You want to split the given URL into its _path-segments_ or more generic: _URL normalization_. Did you search here like [this](https://stackoverflow.com/search?q=path%20segments%20from%20URL) ? What did you try ? Please post (any code) as example. – hc_dev Feb 03 '22 at 07:15
  • Currently im using JavaScript. But i want to do this with regexp, instead something like str.split("/").map(item => item.indexOf("\_") > 0 ? item.split("_")[0] : item) – Alexander Guskov Feb 03 '22 at 07:20
  • Correct. "name", "1", "options" are ignoring – Alexander Guskov Feb 03 '22 at 08:14
  • 1
    You already have a working solution. Is there any reasons why you would switch to RegEx? If you can't write such pattern, you won't be able to maintain it. – Cid Feb 03 '22 at 08:30

7 Answers7

1

you can use this pattern

(?<!\w)[^/_]+

results

['home', 'products', 'product', 'details', 'some']

python code

import re
str="home/products/product_name_1/details/some_options"

re.findall('(?<!\w)[^/_]+',str)

['home', 'products', 'product', 'details', 'some']
arutar
  • 1,015
  • 3
  • 9
1

Given input:

  • string "home/products/product_name_1/details/some_options"

Expected output:

  • array ["home", "products", "product", "details", "some"]
  • Note: ignore/exclude name, 1, options (because word occurs after 1st underscore).

Task:

  • split URI by slash into a set of path-segments (words)
  • (if the path-segment or word contains underscores) remove the part after first underscore

Regex to match

With a regex \/|_\w+ you could match the URL-path separator (slash) and excluded word-part (every word after an underscore).

Then use this regex

  • either as separator to split the string into its parts(excluding the regex matches): e.g. in JS split(/\/|_\w+/)
  • or as search-pattern in replace to prepare a string that can be easily split: e.g. in JS replaceAll(/\/|_\w+/g, ',') to obtain a CSV row which can be easily split by comma `split(',')

Beware: The regular-expression itself (flavor) and functions to apply it depend on your environment/regex-engine and script-/programming-language.

Regex applied in Javascript

split by regex

For example in Javascript use url.split(/\/|_\w*/) where:

  • /pattern/: everything inside the slashes is the regex-pattern
  • \/: a c slash (URL-path-separator)
  • |: the alternate junction, interpreted as boolean OR
  • _\w*: zero or more (*) word-characters (w, i.e. letter from alphabet, numeric digit or underscore) following an underscore

See also:

However, this returns also empty strings (as empty split-off second parts inside underscore-containing path-segments). We can remove the empty strings with a filter where predicate s => s returns true if the string is non-empty.

Demo to solve your task:

const url = "home/products/product_name_1/details/some_options";

let firstWordsInSegments = url.split(/\/|_\w*/).filter(s => s);

console.log(firstWordsInSegments);


const urlDuplicate = "home/products/product_name_1/details/some_options/_/home";
console.log(urlDuplicate.split(/\/|_\w*/).filter(s => s)); // contains duplicates in output array

replace into CSV, then split and exclude (map,replace,filter)

The CSV containing path-segments can be split by comma and resulting parts (path-segments) can be filtered or replaced to exclude unwanted sub-parts.

using:

  • replaceAll to transform to CSV or remove empty strings. Note: global flag required when calling replaceAll with regex
  • map to remove unwanted parts after underscore
  • filter(s => s) to filter out empty strings

const url = "home/products/product_name_1/details/some_options";

// step by step
let pathSegments = url.split('/');
console.log('pathSegments:', pathSegments);
let firstWordsInSegments = pathSegments.map(s => s.replaceAll(/_\w*/g,''));
console.log(firstWordsInSegments);

// replace to obtain CSV and then split
let csv = "home/products/product_name_1/details/some_options/_/home".replaceAll(/\/|_\w+/g, ',');
console.log('csv:', csv);
let parts = csv.split(',');
console.log('parts:', parts); // contains empty parts
let nonEmptyParts = parts.filter(s => s);
console.log('nonEmptyParts:', nonEmptyParts); // filtered out empty parts

Bonus Tip

Try your regex online (e.g. regex101 or regexplanet). See the demo on regex101.

hc_dev
  • 8,389
  • 1
  • 26
  • 38
1

Try this:

input = ["home/products/product_name_1/details/some_options",
    "company/products/cars_all/details/black_color",
    "public/places/1_cities/disctricts/1234_something"]

let pattern = /([a-zA-Z\d]*)(?:\/|_.*?(?:\/|$))/gmi

input.forEach(el => {
    let matches = el.matchAll(pattern)
    for (const match of matches) {
        console.log(match[1]);
    }
})

Remove \d from the regex pattern if you dont want digits in the url. I have used matchAll here, matchAll returns a iterator, use that to get each match object, inside which the first element is the full match, and the second elemnt(index: 1) is the required group.

/([a-zA-Z\d]*)(?:\/|_.*?(?:\/|$))/gmi

/
([a-zA-Z\d]*)         capture group to match letters and digits
(?:\/|_.*?(?:\/|$))   non capture group to match '/' or '_' and everything till another '/' or end of the line is found 
/gmi

You can test this regex here: https://regex101.com/r/B5Bo74/1

anotherGatsby
  • 1,568
  • 10
  • 21
  • If you are using the `/i` flag you can shorten the character class to `[a-z\d]+` and repeat it 1 or more times to prevent matching an empty string. You might turn the first alternation in an optional group and the non greedy dot `.*?` into a negated character class `let pattern = /([a-z\d]+)(?:_[^\/\n]*)?(?:\/|$)/gmi` – The fourth bird Feb 03 '22 at 10:25
1

You can use:

\b[^\W_]+
  • \b A word boundary to prevent a partial match
  • [^\W_]+ Match 1+ word characters except for _

See a regex demo.

const s = "home/products/product_name_1/details/some_options";
const regex = /\b[^\W_]+/g;
console.log(s.match(regex));

If there has to be a leading / or the start of the string before the match, you can use an alternation (?:^|\/) and use a capture group for the values that you want to keep:

const s = "home/products/product_name_1/details/some_options";
const regex = /(?:^|\/)([^\W_]+)/g;
console.log(Array.from(s.matchAll(regex), m => m[1]));
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • 1
    Great simple regex to split words. The use of word-boundaries makes it reusable and more versatile - to split more than just URIs. – hc_dev Feb 03 '22 at 09:49
0

You could split the url with this regex

(_\w*)+|(\/)

This matches the /, _name_1 and _options.

BUT depending what you are trying to to, or which language do you use, there are way better options to do this.

  • `split` will keep the values of the capture groups. You can omit the groups and remove the emtpy entries `s.split(/_\w*|\//).filter(Boolean)` – The fourth bird Feb 03 '22 at 09:59
0

You can try a pattern like \/([^\/_]+){1,} (assuming that the path starts with '/' and the components are separated by '/'); depending on language you might get an array or iterator that will give the components.

Peter Faller
  • 132
  • 1
  • 1
  • 5
0

Try ^[[:alpha:]]+|(?<=\/)[[:alpha:]]+ or ^[a-zA-Z]+|(?<=\/)[a-zA-Z]+ if [[:alpha:]] is not supported , it matches one or more characters on the beginning or after slash until first non char.

Tomáš Šturm
  • 489
  • 4
  • 8