1

I have a very large xml file which I load as a string so my XML lools like

<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
  <article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>

I want to count the number of occurrences the string

article ID="5705641" contentstatus="Changed"

how can I convert the ID to a regex

Here is what I have tried doing

searchstr = 'article ID=\"/[1-9]{7}/\" contentstatus=\"Changed\"'
count = ((xml.scan(searchstr).length)).to_s
puts count

Please let me know how can I achieve this?

Thanks

Kyle Boon
  • 5,213
  • 6
  • 39
  • 50
bkone
  • 251
  • 1
  • 5
  • 15
  • 1
    Obligatory link for XML and regular expressions: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Andrew Grimm Apr 28 '11 at 23:29

4 Answers4

4

I'm going to go out on a limb and guess that you're new to Ruby. First, it's not necessary to convert count into a string to puts it. Puts automatically calls to_s on anything you send to it.

Second, it's rarely a good idea to handle XML with string manipulation. I would strongly advise that you use a full fledged XML parser such as Nokogiri.

That said, you can't embed a regex in a string like that. The entire query string would need to be a regex.

Something like

/article ID="[1-9]{7}" contentstatus="Changed"/

Quotation marks aren't special characters in a regex, so you don't need to escape them.

When in doubt about regex in Ruby, I recommend checking out Rubular.com.

And once again, I can't emphasize enough that I really don't condone trying to manipulate XML via regex. Nokogiri will make dealing with XML a billion times easier and more reliable.

michaeltomer
  • 445
  • 3
  • 7
  • Good answer, and welcome to stack overflow! I was looking for examples of native ruby xml parser but couldn't find one (I don't know anything about ruby). – Kobi Apr 28 '11 at 19:47
  • Yes I'm new to ruby. Thanks much!!! that did the trick. I did think about using nokogiri initially but did not want to install additional gems. If I end up doing a lot of XML stuff I will surely use nokogiri. – bkone Apr 28 '11 at 20:00
  • xboxer21: Gems are a very important part of the language; don't be afraid to install them. Most Rubyists have dozens of them on their development machines. – michaeltomer Apr 28 '11 at 20:15
  • Nokogiri is a standard part of my Ruby install; It is within the first three gems I install always. – the Tin Man Apr 28 '11 at 20:15
2

If XPath is an option, it is a preferred way of selecting XML elements. You can use the selector:

//article[@contentstatus="Changed"]

Or, if possible:

count(//article[@contentstatus="Changed"])
Kobi
  • 135,331
  • 41
  • 252
  • 292
2

Nokogiri is my recommended Ruby XML parser. It's very robust, and is probably the standard for the language now.

I added two more "articles" to show how easily you can find and manipulate the contents, without having to rely on a regex.

require 'nokogiri'

xml =<<EOT
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
  <article ID="5756261" contentstatus="Changed"   doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
  <article ID="5756262" contentstatus="Unchanged" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
  <article ID="5756263" contentstatus="Changed"   doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
EOT

doc = Nokogiri::XML(xml)
puts doc.search('//article[@contentstatus="Changed"]').size.to_s + ' found'

puts doc.search('//article[@contentstatus="Changed"]').map{ |n| "#{ n['ID'] } #{ n['doi'] } #{ n['idID'] }" }

>> 2 found
>> 5756261 10.1109/TNB.2011.2145270 0b0000648151d8ca
>> 5756263 10.1109/TNB.2011.2145270 0b0000648151d8ca

The problem with using regex with HTML or XML, is they'll break really easily if the XML changes, or if your XML comes from different sources or is malformed. Regex was never designed to handle that sort of problem, but a parser was. You could have XML with line ends after every tag, or none at all, and the parser won't really care as long as the XML is well-formed. A good parser, like Nokogiri can even do fixups if the XML is broken, in order to try to make sense of it, but

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
1

Your current string looks almost perfect to me, just remove the errant / from around the numbers:

searchstr = 'article ID=\"[1-9]{7}\" contentstatus=\"Changed\"'
eykanal
  • 26,437
  • 19
  • 82
  • 113
  • @xboxer - Hello. "Didn't work" doesn't help much... It isn't very clear what is expected, and what happens. How doesn't the code you posted in the question work? It seems to work according to http://www.rubular.com/r/OTWemo0A3l – Kobi Apr 28 '11 at 19:36
  • You are right, it works fine in rubular. However I wanted the total count of the occurrences of the string and that count is being returned as a zero. what should I change for it to display the correct count? – bkone Apr 28 '11 at 19:47