0

Say, we have HTML-page, containing links:

a href="katalog/koshelki-i-klatchi/muzhskaya-sumka-planshet-polo-optom1"

a href="katalog/koshelki/kozhanaya-sumka-jeep-optom1"

I need to search using regex one time only (in one search query), and I want output to be:

koshelki-i-klatchi/muzhskaya-sumka-planshet-polo-optom1

koshelki/kozhanaya-sumka-jeep-optom1

What would regular expression for this task be like?

stackFan
  • 1,528
  • 15
  • 22
Roman
  • 1,946
  • 3
  • 20
  • 28

1 Answers1

1

Do you want something like this:

http:\/\/[A-Za-z0-9\.]*(\/[A-Za-z0-9]*)?\/[A-Za-z0-9]+[0-9]{1}

Test it here: https://regex101.com/r/cnxvR0/1

It will match anything starting with http:// followed by any alphabet character, any digit or . (dot), optionally followed by another forward slash (/) and ends with 1 or more alphabet characters or digits and it has to end with a single digit.

I'm sure this will not help for all of your cases, but you have to be more specific, how many digits are there at the end, is it always only one ? Does the URL have to end with a digit or it's optional ? How many nested directories can there be (I made my regex for only one) ?

Let me know if the regex above will do what you need or post in the comment section answers to the questions above and I'll edit my answer accordingly.

OK SO AFTER YOU EDITED YOUR ORIGINAL QUESTION:

(?<=href=")(?:[\w-]+\/?)*

Try it here: https://regex101.com/r/q0tf5l/2

Let me know if this is what you wanted, you can iterate through all of the matches and print them out or whatever you need to do with them.

whatamidoingwithmylife
  • 1,119
  • 1
  • 16
  • 35
  • Thank you for your answer. There will not always be digits in the end. It may not be digit at all. It it just an URL, and could have any amount of nested directories – Roman Mar 07 '18 at 15:45
  • Sorry, I messed up first commend :) – Roman Mar 07 '18 at 15:47
  • And I need to get rid of repeating parts, if there are any, taking only one occurrence – Roman Mar 07 '18 at 15:49
  • Well in that case (valid URL doesn't end with digit) we can't really distinguish between valid URLs and that `randomgarbage` URL, you need to have some kind of distinction between them. – whatamidoingwithmylife Mar 07 '18 at 15:53
  • There is + sign in regex, so I could use (a+) expression to get 'a', 'aaaa', 'aa' as single 'a'. I wonder if I can get any group of letters that way? – Roman Mar 07 '18 at 16:10
  • Or was it exactly what you said? – Roman Mar 07 '18 at 16:11
  • Yea you can, `+` is quantifier, it matches only if there is at least one instance of whatever is preceded by `+`, so it matches 1 or more of the specified character. To match a group of letters, for example "abba", you can do it like this: `(abba)+`, it will match 1 or more `abba` words, so it matches: `abba`, `abbaabba`, `abbaabbaabba` etc. – whatamidoingwithmylife Mar 07 '18 at 16:20
  • Yes, but this way I have to know exactly what the word is, like "abba", but I need to detect duplicates of any random word – Roman Mar 07 '18 at 16:22
  • Depending how complicated what you have in mind is, it may not be suitable for Regular Expressions, it's fine if it's something simpler like: https://stackoverflow.com/questions/2823016/regular-expression-for-consecutive-duplicate-words y – whatamidoingwithmylife Mar 07 '18 at 16:23
  • I'm very sorry, I didn't get any sleep for last 36 hours, so I got my task completely wrong. I edited my question, could you please take a look? – Roman Mar 07 '18 at 16:59
  • But that could also for example match `test` in `` – The fourth bird Mar 07 '18 at 17:15
  • @Thefourthbird He needs to provide me some specific distinction, otherwise this is the best I can come up with, if he only wants to match `a` tag href attributes, he can append `\ – whatamidoingwithmylife Mar 07 '18 at 17:19