-4

my input is:
<span question_number="18"> blah blah blah 1</span><span question_number="19"> blah blah blah 2</span>

and I want my regex to match this <span question_number="somenumber">xxxx</span> pattern
and the desired output is 1.somenumber 2.xxxx

I wrote a naive solution which could cover
<span question_number="18"> blah blah blah 1</span>
<span question_number="19"> blah blah blah 2</span>
notice: they are on different lines
the output is : 18, blah blah blah 1 and 19,blah blah blah 2

but when the input is <span question_number="18"> blah blah blah 1</span><span question_number="19"> blah blah blah 2</span>
which is on the same line

my output is 18, blah blah blah 1</span><span question_number="19"> blah blah blah 2

how could I bypass this problem?

Update: regex: /\<span question_number=(?:\")*(\d*)(?:\")*>(.*)<\/span>/ig

testinput:
case1 -> two lines of code
<span question_number="54">often graces doorways tied into ropes called</span>
<span question_number="54">often graces doorways tied into ropes called <i>ristras</i>.</span>
case2 -> one line of code
<span question_number="54">often graces doorways tied into ropes called</span><span question_number="54">often graces doorways tied into ropes called <i>ristras</i>.</span>

Update2:
This is not a dom , it is just a plain text that I want to process.

Update3: so my problem about Regex is solved, now I have a question about comparing the proessing speed between regex or dom operation ? how could implement such a test ?

MohanL
  • 1
  • 3
  • 9
    Why are you matching HTML with a regular expression? http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – epascarello Sep 07 '16 at 13:05
  • 6
    I urge you to read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 before it's too late – Jaromanda X Sep 07 '16 at 13:05
  • Please may someone edit this? – evolutionxbox Sep 07 '16 at 13:06
  • 2
    Please, PLEASE: do not use regexps to parse HTML! (see http://stackoverflow.com/a/1732454/709439 :-) – MarcoS Sep 07 '16 at 13:07
  • @epascarello this is not actually html, it is not from a webpage, this is is just a plain string. – MohanL Sep 07 '16 at 13:16
  • So you tagged it with JavaScript so you can make it into DOM and query it...Regular expression is not ideal here. And you really do not need to repeat the same thing 4 times. – epascarello Sep 07 '16 at 13:17
  • It may be a plain string, but it still contains what very much looks like HTML - use an HTML parser. If you're in a browser environment, you have one readily available. – James Thorpe Sep 07 '16 at 13:18
  • @epascarello, anyways, could I do it in a ruby environment / – MohanL Sep 07 '16 at 13:22
  • @JamesThorpe actually, this is under a ruby environment – MohanL Sep 07 '16 at 13:23
  • Great. A quick search indicates that there are [dom parsers](http://www.nokogiri.org/) available for Ruby too. – James Thorpe Sep 07 '16 at 13:23
  • The simplest solution is to make the pattern lazy (not greedy) by adding a `?` after the star (e.g. `\(.*?)<\/span>`, but that wouldn't be terribly efficient. To do it properly, regex is not a good solution. As others have said, use an HTML parser to load it into a DOM and then read it that way. – Steven Doggart Sep 07 '16 at 13:25
  • @StevenDoggart wow, that is exactly what I am looking for. Though it is not taht efficient, but I guess comparing to the dom operation, would that be better ? if not, is there any ways I could test out the speed ? – MohanL Sep 07 '16 at 13:33
  • 2
    It is considered impolite to change your question in such a way as to invalidate other people's hard work. In this particular case, multiple people had already put in significant work to solve your problem in JavaScript when you all of a sudden changed your mind and now want a Ruby solution instead. It would be more polite to ask a separate question about Ruby rather than throwing all the hard work away that people have already put into your JavaScript problem. – Jörg W Mittag Sep 07 '16 at 13:38
  • @JörgWMittag my bad – MohanL Sep 07 '16 at 13:52
  • @MohanL Jörg W Mittag's comment also applies to your follow-up question ("Update3"). Please post a separate question regarding the benchmark. – Stefan Sep 07 '16 at 13:57
  • 1
    Please do not use "edit" or "update" tags in your question (or answers) as it results in text that is hard to read. Instead, merge the changes into the text as if they were there originally. We can see what changed if we need to. Also, please read the formatting help which helps us understand what you are asking. The easier it is for us to read, the more quickly and accurately we can help you. – the Tin Man Sep 07 '16 at 18:57

4 Answers4

3

Although you are not parsing an entire HTML document, your input obviously contains HTML elements.

In either case, Nokogiri is the library of choice:

require 'nokogiri'

input = '<span question_number="18"> blah blah blah 1</span><span question_number="19"> blah blah blah 2</span>'

doc = Nokogiri::HTML.fragment(input)
doc.css('span').map { |s| [s[:question_number], s.text] }
#=> [["18", " blah blah blah 1"], ["19", " blah blah blah 2"]]
Stefan
  • 109,145
  • 14
  • 143
  • 218
1

If it really isn't HTML (hmm?) you could do it with

<span question_number="(\d+)">(.*?)<\/span>

See it here at regex101.

The problem with your original regex is that it's greedy. The part (.*) will match as many characters it can, making sure the remaining <\/span> still can be matched. So it finds the first <span... and matches up to the last </span>. My attempt at a solution is non-greedy (The ? in (.*?)), thus just matching to the first </span>.

SamWhan
  • 8,296
  • 1
  • 18
  • 45
1

Even though you insist that this isn't HTML, it sure looks and smells like it, and it can, in fact, easily be parsed by an HTML parser:

require 'nokogiri'

doc = Nokogiri::HTML.fragment <<~'HTML'
  <span question_number="54">often graces doorways tied into ropes called</span> 
  <span question_number="54">often graces doorways tied into ropes called <i>ristras</i>.</span>
HTML

doc.xpath('span').map {|span| next span[:question_number].to_i, span.text }
#=> [[54, "often graces doorways tied into ropes called"], [54, "often graces doorways tied into ropes called ristras."]]

It is not quite clear to me why you insist on not using an HTML parser for something that is obviously HTML.

Jörg W Mittag
  • 363,080
  • 75
  • 446
  • 653
0

I've looked at this problem as if there was a string involved - not a DOM environment. At the end of the day it's < and > that suddenly make it HTML. If you are in control of that string and you understand what it will contain and the boundaries of it then there are many solutions to a problem if it's specific to your needs.

Anyway, if you are looking for an answer and you know all of your questions absolutely live inside a <span> with an attribute of "question_number" then I guess you could do something like this. No Regex.

This is a simple version demonstrating how you could extract the information from a HTML string. For simplicity I've stuck it inside a textarea so you can see it actually working. You could copy this code and run it.

However, in reality you will probably want to get the innerHTML value of a container that you know contains all of the <span> tags.

I know there would be a number of different ways of solving this as many suggested but this is an answer to your specific need.

<html><body>
    <textarea id='htmlstring'>
        <div>Random HTML Before</div>
        <span question_number="18">blah blah blah 1</span>
        <span question_number="19">blah blah blah 1</span>
        <span question_number="21">blah blah blah 1</span>
        <span question_number="22">blah blah blah 1</span>
        <div>Random HTML After</div>
    </textarea>
    <script type="text/javascript">
        var t = document.getElementById('htmlstring');
        var q = t.value.split("<span question_number=");
        q.shift();
        for(var i in q){
            var d = q[i].split("</span>")[0];
            d = d.replace("\">","|");
            d = d.replace("\"","");
            d = d.split("|");
            alert("num="+d[0]+" val="+d[1]);
        }
    </script>
</body></html>
Mr Lister
  • 45,515
  • 15
  • 108
  • 150
Watts Epherson
  • 692
  • 5
  • 9
  • Note: the OP removed the [tag:javascript] tag and added the [tag:ruby] tag about 10 minutes before you posted your answer (so presumably while you were writing it). Unfortunately, this invalidates your answer. – Jörg W Mittag Sep 07 '16 at 13:35
  • Hi, thank you for your work, so I understand how to do the dom operation, but like do you know to test the speed between using regex and dom operation ? – MohanL Sep 07 '16 at 13:48
  • I do not know the difference in speed between regex and dom operation on the specific code you are parsing. I also don't know how many times you intend to perform the operation. Sorry I can't be of any further help. @JörgWMittag - Thanks for the head's up! Yes that's exactly what has happened! grrr :) – Watts Epherson Sep 15 '16 at 08:28