1

I googled half of internet searching help in my case.

So, what I need:

I have HTML structure for parsing like that:

<div class="foo">
  <div class='bar' dir='ltr'>
    <div id='p1' class='par'>
      <p class='sb'>
        <span id='dc_1_1' class='dx'>
          <a href='/bar32560'>1</a>
        </span>
        Neque porro 
        <a href='/xyz' class='mr'>+</a>
        quisquam est 
        <a href='/xyz' class='mr'>+</a>
        qui. 
      </p>
    </div>
    <div id='p2' class='par'>
      <p class='sb'>
        <span id='dc_1_2' class='dx'>
          <a href='/foo12356'>2</a>
        </span>
        dolorem ipsum 
        <a href='/xyz' class='mr'>+</a>
        quia dolor sit amet, 
        <a href='/xyz' class='mr'>+</a>
        consectetur, adipisci velit.
      </p>
    </div>
    <div id='p3' class='par'>
      <p class='sb'>
        <span id='dc_1_3' class='dx'>
          <a href='/foobar4586'>3</a>
        </span>
        Neque porro quisquam 
        <a href='/xyz' class='mr'>+</a>
        est qui dolorem ipsum quia dolor sit 
        <a href='/xyz' class='mr'>+</a>
        amet, t.
        <a href='/xyz' class='mr'>+</a>
        <span id='dc_1_4' class='dx'>
          <a href='/barefoot4135'>4</a>
        </span>
        consectetur, 
        <a href='/xyz' class='mr'>+</a>
        adipisci veli.
        <span id='dc_1_5' class='dx'>
          <a href='/barfoo05123'>5</a>
       </span>
       Neque porro 
       <a href='/xyz' class='mr'>+</a>
       quisquam est
       <a href='/xyz' class='mr'>+</a>
       qui.
     </p>
   </div>
 </div>
</div>

What I need (IN ENGLISH): scrape each paragraph BUT I need final scraped text object content in form:

scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body 4 => 4 consectetur, adipisci veli.
scraped_body 5 => 5 Neque porro quisquam est qui.

Code what i use for now:

page = Nokogiri::HTML(open(url))
x = page.css('.mr').remove
x.xpath("//div[contains(@class, 'par')]").map do |node|
  body = node.text
end

My result is like:

scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t. 4 consectetur, adipisci veli. 5 Neque porro quisquam est qui.

So this scrape whole text from div paragraph class 'par'. I need to scrape whole text after each span with his content - numbers. Or cut those div's before each span.

I need something like:

SPAN.text + P.text - a.mr

I dunno… how to do this

Please help me with this parsing. I need scrape after/before each span - I guess.

Please help, I've tried everything what i found.


EDIT DUCK @Duck1337:

I use followed code:

def verses
    page = Nokogiri::HTML(open(url))
    i=0
    x = page.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM").map do |node|
    i+=1
    body = node
    VerseSource.new(body, book_num, number, i)
  end
end

I need this because I parse a big website with text. There is few more methods. So my final output looks like:

Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam est qui.

But if I have single werse with multiple sentences then your code split it by every sentence. So this is to much split.

For example:

    <div id='p1' class='par'>
      <p class='sb'>
        <span id='dc_1_3' class='dx'>
          <a href='/foobar4586'>1</a>
        </span>
        Neque porro quisquam. Est qui dolorem
        <a href='/xyz' class='mr'>+</a>
        <span id='dc_1_3' class='dx'>
          <a href='/foobar4586'>2</a>
        </span>
        est qui dolorem ipsum quia dolor sit. 
        <a href='/xyz' class='mr'>+</a>
        amet, t.

Your code split like that:

Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam.
Saved record with: book: 1, chapter: 1, verse: 2, body: Est qui dolorem
Saved record with: book: 1, chapter: 1, verse: 3, body: 2 est qui dolorem ipsum quia dolor sit.

Hope you what I mean. Really BIG Thanks to you for that. If you can modify this it will be great!


EDIT: @KARDEIZ

Thanks for answer! When I use your code inside of my method: It parsed really radom stuff.

def verses
  page = Nokogiri::HTML(open(url))
  i=0
  #page.css(".mr").remove
  page.xpath("//div[contains(@class, 'par')]//span").map do |node|
    node.content.strip.tap do |out|
      while nn = node.next
        break if nn.name == 'span'
        out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
        node = nn
      end
    end
    i+=1
    body = node
    VerseSource.new(body, book_num, number, i)
  end
end

The output is like:

Saved record with: book: 1, chapter: 1, verse: 1, body:  <here is last part of last sentence in first paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 2, body:  <here is last part of last sentence in second paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 3, body:
Saved record with: book: 1, chapter: 1, verse: 4, body:
Saved record with: book: 1, chapter: 1, verse: 5, body:  <here is last sentence in third paragraph. It is after last "+" in this paragraph and have no more "+" signs(href)

As you can see, I dunno how it make such a mess ;] Can you do something more with that? Thanks a lot!


Regards!

hash4di
  • 21
  • 4

3 Answers3

0

I saved your input as "temp.html" on my desktop.

require 'open-uri'
require 'nokogiri'

$page_html = Nokogiri::HTML.parse(open("/home/user/Desktop/temp.html"))

output = $page_html.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM")

# I found the pattern ". " in every line, so i replaced ". " with (". HAM")
# I did that by using gsub(". ", ". HAM") this means replace ". " with ". HAM"

# then i split up the string with " HAM" so it preserved the "." in each item in the array


output = ["1 Neque porro quisquam est qui.", "2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.", "3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.", "4 consectetur, adipisci veli.", "5 Neque porro quisquam est qui."]

EDIT:

 %w[nokogiri open-uri].each{|gem| require gem}     

 $url = "/home/user/Desktop/temp.html"
 def verses
     page = Nokogiri::HTML(open($url))
     i=0
     x = page.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ".    HAM").split(" HAM") do |node|
         i+=1
         body = node
         VerseSource.new(body, book_num, number, i)
    end
 end
Duck1337
  • 524
  • 4
  • 16
  • Thanks @Duck1337 for answer. RLY Sorry but I forgott about very important parto of HTML structure. In each paragraph section except span element I have a href as a "+" sign what is a link to dictionary explain previous part of text. So the pattern is more complicated, because this a_href is in random places. I've edited my question to be more accurate and complete. – hash4di Jul 16 '14 at 07:22
  • I put another .gsub("+", " ") so it removes the links from the a_href – Duck1337 Jul 16 '14 at 14:28
  • Thanks @Duck1337. But still I have problem. Please review EDIT in my question: EDIT DUCK. Thanks much! – hash4di Jul 16 '14 at 18:59
  • Try, x = page.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM") do |node| instead of x = page.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM").map do |node| – Duck1337 Jul 16 '14 at 20:11
0

Try something like:

x.xpath("//div[contains(@class, 'par')]//span").map do |node|
  out = node.content.strip
  if following = node.at_xpath('following-sibling::text()')
    out << ' ' << following.content.strip
  end
  out
end

The following-sibling::text() XPATH will get the first text node after the span.

EDIT

I think this does what you want:

html.xpath("//div[contains(@class, 'par')]//span").map do |node|
  node.content.strip.tap do |out|
    while nn = node.next
      break if nn.name == 'span'
      out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
      node = nn
    end
  end  
end

outputs:

[
  "1 Neque porro quisquam est qui.",
  "2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.",
  "3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.",
  "4 consectetur, adipisci veli.",
  "5 Neque porro quisquam est qui."
]

It's also possible to do this with pure XPath (see XPath axis, get all following nodes until), but this solution is more simple from a coding perspective.

EDIT 2

Try this:

def verses
  page = Nokogiri::HTML(open(url))
  i=0
  page.xpath("//div[contains(@class, 'par')]//span").map do |node|
    body = node.content.strip.tap do |out|
      while nn = node.next
        break if nn.name == 'span'
        out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
        node = nn
      end
    end
    i+=1
    VerseSource.new(body, book_num, number, i)
  end
end
Community
  • 1
  • 1
Jacob Brown
  • 7,221
  • 4
  • 30
  • 50
  • Thank's @kardeiz . Sorry but I forgott about very important thing in HTML structure. In each paragraph I have a_href links with .mr class as "+" sings what is a links to dictionary after some part of text - for explain this part. When I use your's solution I receive only the first paragraph element after span - I tried this before too. It's not what I need, because its scrapes for example in first paragraph only: Neque porro – hash4di Jul 16 '14 at 07:13
  • I've edited my question to be more accurate and complete. Please, look again. Thanks again! – hash4di Jul 16 '14 at 07:24
  • thanks for answer and update. I've tried you code and I had problems. Please review my EDIT: KARDEIZ. Hope so its clear and easy to read. Thanks! – hash4di Jul 16 '14 at 19:13
  • @hash4di, I've updated my answer. Is `body` supposed to be a node or a string? In my updated answer, `body` will be set to the string value mentioned previously, e.g.: "1 Neque porro quisquam est qui." – Jacob Brown Jul 16 '14 at 19:37
  • PERFECT!!! It's 23PM in my Time Zone so I'm in not really good condition but this 'looks legit'. Thanks!!! for now :) I'll check this tomorrow. Cheers! – hash4di Jul 16 '14 at 21:00
0
require 'nokogiri'

your_html =<<END_OF_HTML
<your html here>
END_OF_HTML

doc  = Nokogiri::HTML(your_html)
text_nodes = doc.xpath("//div[contains(@class, 'par')]/p/child::text()")

results = text_nodes.reject do |text_node| 
  text_node.text.match /\A \s+ \z/x  #Eliminate whitespace nodes
end

results.each_with_index do |node, i|
  puts "scraped_body#{i+1} => #{node.text.strip}"
end


--output:--
scraped_body1 => Neque porro quisquam est qui.
scraped_body2 => dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.
scraped_body3 => Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body4 => consectetur, adipisci veli.
scraped_body5 => Neque porro quisquam est qui.

Answer for new html:

require 'nokogiri'

html = <<END_OF_HTML
your new html here
END_OF_HTML

html_doc  = Nokogiri::HTML(html)
current_group_number = nil
non_ws_text = []  #non_whitespace_text for each group

html_doc.css("div.par > p").each do |p|   #p's that are direct children of <div class="par">
  p.xpath("./node()").each do |node|  #All Text and Element nodes that are direct children of p tag.
    case node
    when  Nokogiri::XML::Element
      if node.name == 'span'
        node.xpath(".//a").each do |a|  #Step through all the <a> tags inside the <span>
          md = a.text.match(/\A (\d+) \z/xm)  #Check for numbers

          if md  #Then found a number, so it's the start of the next group
            if current_group_number  #then print the results for the current group
              print "scraped_body #{current_group_number} => "
              puts "#{current_group_number} #{non_ws_text.join(' ')}"
              non_ws_text = []
            end
            current_group_number = md[1] #Record the next group number 
            break  #Only look for the first <a> tag containing a number
          end

        end
      end

    when Nokogiri::XML::Text
      text = node.text
      non_ws_text << text.strip if text !~ /\A \s+ \z/xm 
    end

  end
end

#For the last group: 
print "scraped_body #{current_group_number} => "
puts "#{current_group_number} #{non_ws_text.join(' ')}"

--output:--
scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body 4 => 4 consectetur, adipisci veli.
scraped_body 5 => 5 Neque porro quisquam est qui.
7stud
  • 46,922
  • 14
  • 101
  • 127
  • Thanks @7stud for answer. RLY Sorry but I forgot about very important part of HTML structure. In each paragraph section except span element I have a href as a "+" sign what is a link to dictionary explain previous part of text. So the pattern is more complicated, because this a_href is in random places. I've edited my question to be more accurate and complete. But when I use your solution I don't receive anything. There is not some typo in your REGEXP? – hash4di Jul 16 '14 at 07:50
  • @hash4di, I added a revised answer to my post. – 7stud Aug 02 '14 at 19:36