7

I'm wondering if there's a function in Ruby like is_xml?(string) to identify if a given string is XML formatted.

sawa
  • 165,429
  • 45
  • 277
  • 381
mCY
  • 2,731
  • 7
  • 25
  • 43

2 Answers2

20

Nokogiri's parse uses a simple regex test looking for <html> in an attempt to determine if the data to be parsed is HTML or XML:

string =~ /^s*<[^Hh>]*html/ # Probably html

Something similar, looking for the XML declaration would be a starting point:

string = '<?xml version="1.0"?><foo><bar></bar></foo>'
string.strip[/\A<\?xml/]
=> "<?xml"

If that returns anything other than nil the string contains the XML declaration. It's important to test for this because an empty string will fool the next steps.

Nokogiri::XML('').errors.empty?
=> true

Nokogiri also has the errors method, which will return an array of errors after attempting to parse a document that is malformed. Testing that for any size would help:

Nokogiri::XML('<foo>').errors
=> [#<Nokogiri::XML::SyntaxError: Premature end of data in tag foo line 1>]
Nokogiri::XML('<foo>').errors.empty?
=> false

Nokogiri::XML(string).errors.empty?
=> true

would be true if the document is syntactically valid.


I just tested Nokogiri to see if it could tell the difference between a regular string vs. true XML:

[2] (pry) main: 0> doc = Nokogiri::XML('foo').errors
[
    [0] #<Nokogiri::XML::SyntaxError: Start tag expected, '<' not found>
]

So, you can loop through your files and sort them into XML and non-XML easily:

require 'nokogiri'

[
  '',
  'foo',
  '<xml></xml>'
].group_by{ |s| (s.strip > '') && Nokogiri::XML(s).errors.empty? }
=> {false=>["", "foo"], true=>["<xml></xml>"]}

Assign the result of group_by to a variable, and you'll have a hash you can check for non-XML (false) or XML (true).

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
1

There is no such function in Ruby's String class or Active Support's String extensions, but you can use Nokogiri to detect errors in XML:

begin
  bad_doc = Nokogiri::XML(badly_formed) { |config| config.strict }
rescue Nokogiri::XML::SyntaxError => e
  puts "caught exception: #{e}"
end
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
nurettin
  • 11,090
  • 5
  • 65
  • 85
  • This doesn't tell us much: `Nokogiri::XML('') { |config| config.strict } => #`. An empty string isn't XML, nor is it correctly or incorrectly formatted. `Nokogiri::XML('').errors` will tell you if there are errors, but more clearly. – the Tin Man Dec 27 '12 at 09:32
  • @theTinMan right, the link has an example of .errors usage as well. `puts bad_doc.errors` – nurettin Dec 27 '12 at 12:58
  • Thanks for your answer. Now I know what to do~ – mCY Dec 28 '12 at 02:19