Split using multiple keywords using regex

Question

Well I have a string containing (actually without line breaks)

<td class="coll-1 name">
  <a href="/sub/48/0/" class="icon"><i class="flaticon-divx"></i></a>
  <a href="/torrent/2349324/some-stuuf-here/">SAME stuff here</a>
  <span class="comments"><i class="flaticon-message"></i>1</span>
</td>

and I want an array to store the string which is split using href=" and /"> specifically. How can i do that. I have tried this out.

new_array=my_string.split(/ href="  ,   \/">/)

Edit:

.split(/href="/)

This works out too good but not with the other part.

.split(/\/">/)

Similarly this works too But i am unable to combine them together into 1 line.

[You can't parse (X)HTML with regex](https://stackoverflow.com/a/1732454/477037). Anyway – what is your expected result? — Stefan, Aug 02 '17 at 11:57
@Stefan `/torrent/2349324/some-stuuf-here` is my expected result. — Rishav, Aug 02 '17 at 12:01
Why not `/sub/48/0/`? How do you determine the correct link? — Stefan, Aug 02 '17 at 12:06
@Stefan Thats the whole point. What I am trying is to make a bot that gives me the `/torrent/2349324/some-stuuf-here/` from a webpage. I have the very line from the webpage which is stored in `my_string`. I just want to extract the address from it. I determine the correct link by knowing that the correct link ends right with `/">` every time. — Rishav, Aug 02 '17 at 12:09
Please edit your question rather than elaborating in comments. Not all readers see all comments. — Cary Swoveland, Aug 02 '17 at 15:07

score 2 · Answer 1 · answered Aug 02 '17 at 13:18

2

Given this string:

string = <<-HTML
  <td class="coll-1 name">
    <a href="/sub/48/0/" class="icon"><i class="flaticon-divx"></i></a>
    <a href="/torrent/2349324/some-stuuf-here/">SAME stuff here</a>
    <span class="comments"><i class="flaticon-message"></i>1</span>
  </td>
HTML

and assuming that the correct link is the one without icon class, you could use the CSS selector a:not(.icon), for example via Nokogiri:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(string)

doc.at_css('a:not(.icon)')[:href]
#=> "/torrent/2349324/some-stuuf-here/"

answered Aug 02 '17 at 13:18

Stefan

109,145
14
143
218

Above my head. xD – Rishav Aug 02 '17 at 13:44
@Rishav what exactly? CSS selectors? – Stefan Aug 02 '17 at 13:53
Your solution is giving the required result after installing the gem though. But really I know only basic Ruby and batch commands. Icon class CSS n all r too far. Thanks though. Learned something. – Rishav Aug 02 '17 at 14:00

Gerry · Accepted Answer · 2017-08-02T13:00:42.157

1

You can take advantage of lookahead and lookbehind, like this:

my_string.scan(/(?<=href=").*(?=\/">)/)
#=> ["/torrent/2349324/some-stuuf-here"]

This will return an array with all occurrences of href=" ... /"> with only the ... part (which can be any string).

Or you can get everything that matches href=".../"> and then remove href=" and the trailing /">, something like this:

my_string.scan(/(?:href=".*\/">)/).map { |e| e.gsub(/(href="|\/">)/, "") }
#=> ["/torrent/2349324/some-stuuf-here"]

This will return an array of all instances that match /href=".*\/">/.

How do i split using 2 keywords using regex

You can use a | to denote an or in regex, like this:

my_string.split(/(?:href="|/">)/)

edited Aug 02 '17 at 13:00

answered Aug 02 '17 at 12:38

Gerry

10,337
3
31
40

Thank You very much Sir this works very well and also does what is needed. But with respect it does not answer my question. The question is "How do i split using 2 keywords using regex" in my case. Basically How do i merge those 2 regex splits. The point here is to learn and not just to come to the answer. The required string could be extracted by splitting the string with `"` and then selecting the 9th index but sincerely I want to learn and not just jump to the conclusions. Thanks anyways. Please add the solution to your answer. Thanks again for the answer. – Rishav Aug 02 '17 at 12:53
@Rishav You can use `|` to denote _or_, for example: `my_string.split(/(?:href="|/">)/)` will split either with `href="` or with `/">`. Is that what your are looking for? – Gerry Aug 02 '17 at 12:58
1

Damn That effort though. Upvoting. :) – Aug 02 '17 at 13:00

Split using multiple keywords using regex

2 Answers2