0

Well I have a string containing (actually without line breaks)

<td class="coll-1 name">
  <a href="/sub/48/0/" class="icon"><i class="flaticon-divx"></i></a>
  <a href="/torrent/2349324/some-stuuf-here/">SAME stuff here</a>
  <span class="comments"><i class="flaticon-message"></i>1</span>
</td>

and I want an array to store the string which is split using href=" and /"> specifically. How can i do that. I have tried this out.

new_array=my_string.split(/ href="  ,   \/">/)

Edit:

.split(/href="/)

This works out too good but not with the other part.

.split(/\/">/)

Similarly this works too But i am unable to combine them together into 1 line.

Stefan
  • 109,145
  • 14
  • 143
  • 218
Rishav
  • 3,818
  • 1
  • 31
  • 49
  • 5
    [You can't parse (X)HTML with regex](https://stackoverflow.com/a/1732454/477037). Anyway – what is your expected result? – Stefan Aug 02 '17 at 11:57
  • @Stefan `/torrent/2349324/some-stuuf-here` is my expected result. – Rishav Aug 02 '17 at 12:01
  • Why not `/sub/48/0/`? How do you determine the correct link? – Stefan Aug 02 '17 at 12:06
  • @Stefan Thats the whole point. What I am trying is to make a bot that gives me the `/torrent/2349324/some-stuuf-here/` from a webpage. I have the very line from the webpage which is stored in `my_string`. I just want to extract the address from it. I determine the correct link by knowing that the correct link ends right with `/">` every time. – Rishav Aug 02 '17 at 12:09
  • 1
    Please edit your question rather than elaborating in comments. Not all readers see all comments. – Cary Swoveland Aug 02 '17 at 15:07

2 Answers2

2

Given this string:

string = <<-HTML
  <td class="coll-1 name">
    <a href="/sub/48/0/" class="icon"><i class="flaticon-divx"></i></a>
    <a href="/torrent/2349324/some-stuuf-here/">SAME stuff here</a>
    <span class="comments"><i class="flaticon-message"></i>1</span>
  </td>
HTML

and assuming that the correct link is the one without icon class, you could use the CSS selector a:not(.icon), for example via Nokogiri:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(string)

doc.at_css('a:not(.icon)')[:href]
#=> "/torrent/2349324/some-stuuf-here/"
Stefan
  • 109,145
  • 14
  • 143
  • 218
1

You can take advantage of lookahead and lookbehind, like this:

my_string.scan(/(?<=href=").*(?=\/">)/)
#=> ["/torrent/2349324/some-stuuf-here"]

This will return an array with all occurrences of href=" ... /"> with only the ... part (which can be any string).

Or you can get everything that matches href=".../"> and then remove href=" and the trailing /">, something like this:

my_string.scan(/(?:href=".*\/">)/).map { |e| e.gsub(/(href="|\/">)/, "") }
#=> ["/torrent/2349324/some-stuuf-here"]

This will return an array of all instances that match /href=".*\/">/.

How do i split using 2 keywords using regex

You can use a | to denote an or in regex, like this:

my_string.split(/(?:href="|/">)/)
Gerry
  • 10,337
  • 3
  • 31
  • 40
  • Thank You very much Sir this works very well and also does what is needed. But with respect it does not answer my question. The question is "How do i split using 2 keywords using regex" in my case. Basically How do i merge those 2 regex splits. The point here is to learn and not just to come to the answer. The required string could be extracted by splitting the string with `"` and then selecting the 9th index but sincerely I want to learn and not just jump to the conclusions. Thanks anyways. Please add the solution to your answer. Thanks again for the answer. – Rishav Aug 02 '17 at 12:53
  • @Rishav You can use `|` to denote _or_, for example: `my_string.split(/(?:href="|/">)/)` will split either with `href="` or with `/">`. Is that what your are looking for? – Gerry Aug 02 '17 at 12:58
  • 1
    Damn That effort though. Upvoting. :) –  Aug 02 '17 at 13:00