I googled half of internet searching help in my case.
So, what I need:
I have HTML structure for parsing like that:
<div class="foo">
<div class='bar' dir='ltr'>
<div id='p1' class='par'>
<p class='sb'>
<span id='dc_1_1' class='dx'>
<a href='/bar32560'>1</a>
</span>
Neque porro
<a href='/xyz' class='mr'>+</a>
quisquam est
<a href='/xyz' class='mr'>+</a>
qui.
</p>
</div>
<div id='p2' class='par'>
<p class='sb'>
<span id='dc_1_2' class='dx'>
<a href='/foo12356'>2</a>
</span>
dolorem ipsum
<a href='/xyz' class='mr'>+</a>
quia dolor sit amet,
<a href='/xyz' class='mr'>+</a>
consectetur, adipisci velit.
</p>
</div>
<div id='p3' class='par'>
<p class='sb'>
<span id='dc_1_3' class='dx'>
<a href='/foobar4586'>3</a>
</span>
Neque porro quisquam
<a href='/xyz' class='mr'>+</a>
est qui dolorem ipsum quia dolor sit
<a href='/xyz' class='mr'>+</a>
amet, t.
<a href='/xyz' class='mr'>+</a>
<span id='dc_1_4' class='dx'>
<a href='/barefoot4135'>4</a>
</span>
consectetur,
<a href='/xyz' class='mr'>+</a>
adipisci veli.
<span id='dc_1_5' class='dx'>
<a href='/barfoo05123'>5</a>
</span>
Neque porro
<a href='/xyz' class='mr'>+</a>
quisquam est
<a href='/xyz' class='mr'>+</a>
qui.
</p>
</div>
</div>
</div>
What I need (IN ENGLISH): scrape each paragraph BUT I need final scraped text object content in form:
scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body 4 => 4 consectetur, adipisci veli.
scraped_body 5 => 5 Neque porro quisquam est qui.
Code what i use for now:
page = Nokogiri::HTML(open(url))
x = page.css('.mr').remove
x.xpath("//div[contains(@class, 'par')]").map do |node|
body = node.text
end
My result is like:
scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t. 4 consectetur, adipisci veli. 5 Neque porro quisquam est qui.
So this scrape whole text from div paragraph class 'par'. I need to scrape whole text after each span with his content - numbers. Or cut those div's before each span.
I need something like:
SPAN.text + P.text - a.mr
I dunno… how to do this
Please help me with this parsing. I need scrape after/before each span - I guess.
Please help, I've tried everything what i found.
EDIT DUCK @Duck1337:
I use followed code:
def verses
page = Nokogiri::HTML(open(url))
i=0
x = page.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM").map do |node|
i+=1
body = node
VerseSource.new(body, book_num, number, i)
end
end
I need this because I parse a big website with text. There is few more methods. So my final output looks like:
Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam est qui.
But if I have single werse with multiple sentences then your code split it by every sentence. So this is to much split.
For example:
<div id='p1' class='par'>
<p class='sb'>
<span id='dc_1_3' class='dx'>
<a href='/foobar4586'>1</a>
</span>
Neque porro quisquam. Est qui dolorem
<a href='/xyz' class='mr'>+</a>
<span id='dc_1_3' class='dx'>
<a href='/foobar4586'>2</a>
</span>
est qui dolorem ipsum quia dolor sit.
<a href='/xyz' class='mr'>+</a>
amet, t.
Your code split like that:
Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam.
Saved record with: book: 1, chapter: 1, verse: 2, body: Est qui dolorem
Saved record with: book: 1, chapter: 1, verse: 3, body: 2 est qui dolorem ipsum quia dolor sit.
Hope you what I mean. Really BIG Thanks to you for that. If you can modify this it will be great!
EDIT: @KARDEIZ
Thanks for answer! When I use your code inside of my method: It parsed really radom stuff.
def verses
page = Nokogiri::HTML(open(url))
i=0
#page.css(".mr").remove
page.xpath("//div[contains(@class, 'par')]//span").map do |node|
node.content.strip.tap do |out|
while nn = node.next
break if nn.name == 'span'
out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
node = nn
end
end
i+=1
body = node
VerseSource.new(body, book_num, number, i)
end
end
The output is like:
Saved record with: book: 1, chapter: 1, verse: 1, body: <here is last part of last sentence in first paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 2, body: <here is last part of last sentence in second paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 3, body:
Saved record with: book: 1, chapter: 1, verse: 4, body:
Saved record with: book: 1, chapter: 1, verse: 5, body: <here is last sentence in third paragraph. It is after last "+" in this paragraph and have no more "+" signs(href)
As you can see, I dunno how it make such a mess ;] Can you do something more with that? Thanks a lot!
Regards!