0

I have a string which has plain text and extra spaces and carriage returns then XML-like tags followed by XML tags:

String = "hi there.

<SET-TOPIC> INITIATE </SET-TOPIC>

<SETPROFILE>
   <KEY>name</KEY>
   <VALUE>Joe</VALUE>
</SETPROFILE>

 <SETPROFILE>
   <KEY>email</KEY>
   <VALUE>Email@hi.com</VALUE>
</SETPROFILE>

<GET-RELATIONS>
  <COLLECTION>goals</COLLECTION>
  <VALUE>walk upstairs</VALUE>
</GET-RELATIONS>
So what do you think?

Is it true?
 "

I want to parse this similar to use Nori or Nokogiri or Ox where they convert XML to a hash.

My goal is to be able to easily pull out the top level tags as keys and then know all the elements, something like:

Keys = ['SETPROFILE', 'SETPROFILE', 'SET-TOPIC', 'GET-OBJECT']

Values[0] = [{name => Joe}, {email => email@hi.com}]
Values[3] = [{collection => goals}, {value => walk up}]

I have seen several functions like that for true XML but all of mine are partial.

I started going down this line of thinking:

parsed = doc.search('*').each_with_object({}) do |n, h| 
  (h[n.name] ||= []) << n.text 
end
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Satchel
  • 16,414
  • 23
  • 106
  • 192

2 Answers2

1

I'd probably do something along these lines if I wanted the keys and values variables:

require 'nokogiri'

string = "hi there.

<SET-TOPIC> INITIATE </SET-TOPIC>

<SETPROFILE>
   <KEY>name</KEY>
   <VALUE>Joe</VALUE>
</SETPROFILE>

 <SETPROFILE>
   <KEY>email</KEY>
   <VALUE>Email@hi.com</VALUE>
</SETPROFILE>

<GET-RELATIONS>
  <COLLECTION>goals</COLLECTION>
  <VALUE>walk upstairs</VALUE>
</GET-RELATIONS>
So what do you think?

Is it true?
"

doc = Nokogiri::XML('<root>' + string + '</root>', nil, nil, Nokogiri::XML::ParseOptions::NOBLANKS)

nodes = doc.root.children.reject { |n| n.is_a?(Nokogiri::XML::Text) }.map { |node| 
  [
    node.name, node.children.map { |c|
      [c.name, c.content]
    }.to_h
  ]
}
nodes
# => [["SET-TOPIC", {"text"=>" INITIATE "}],
#     ["SETPROFILE", {"KEY"=>"name", "VALUE"=>"Joe"}],
#     ["SETPROFILE", {"KEY"=>"email", "VALUE"=>"Email@hi.com"}],
#     ["GET-RELATIONS", {"COLLECTION"=>"goals", "VALUE"=>"walk upstairs"}]]

From nodes it's possible to grab the rest of the detail:

keys = nodes.map(&:first)
# => ["SET-TOPIC", "SETPROFILE", "SETPROFILE", "GET-RELATIONS"]

values = nodes.map(&:last)
# => [{"text"=>" INITIATE "},
#     {"KEY"=>"name", "VALUE"=>"Joe"},
#     {"KEY"=>"email", "VALUE"=>"Email@hi.com"},
#     {"COLLECTION"=>"goals", "VALUE"=>"walk upstairs"}]

values[0] # => {"text"=>" INITIATE "}

If you'd rather, it's possible to pre-process the DOM and remove the top-level text:

doc.root.children.select { |n| n.is_a?(Nokogiri::XML::Text) }.map(&:remove)
doc.to_xml
# => "<root><SET-TOPIC> INITIATE </SET-TOPIC><SETPROFILE><KEY>name</KEY><VALUE>Joe</VALUE></SETPROFILE><SETPROFILE><KEY>email</KEY><VALUE>Email@hi.com</VALUE></SETPROFILE><GET-RELATIONS><COLLECTION>goals</COLLECTION><VALUE>walk upstairs</VALUE></GET-RELATIONS></root>\n"

That makes it easier to work with the XML.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • The hardest part is standing back and finding the best way to grab the chunks you need. That comes largely from experience. – the Tin Man Apr 27 '15 at 04:52
0

Wrap the string content in a node and you can parse that with Nokogiri. The text outside the XML segment will be text node in the new node.

str = "hi there. .... Is it true?"
doc = Nokogiri::XML("<wrapper>#{str}</wrapper>")
segments = doc.xpath('/*/SETPROFILE')

Now you can use "Convert a Nokogiri document to a Ruby Hash" to convert the segments into a hash.

However, if the plain text contains some characters that needs to be escaped in the XML spec you'll need to find those and escape them yourself.

Community
  • 1
  • 1
Arie Xiao
  • 13,909
  • 3
  • 31
  • 30
  • Do I have to explicitly call out the path, such as SETPROFILE? I was hoping that any top level nodes could be in an array such as parsed.keys = ['SETPROFILE','SETPROFILE','GET-OBJECT','SET-TOPIC'] and then the values in the node would be hash. Although in that case what happens if there are no beneath? – Satchel Apr 25 '15 at 13:08