3

I am trying to access all of the < spectrum > tags in an XML file that have "ms level" value equal to 1. Then, produce a .txt file that contains strings of data within the tag including bits, whether or not the data is compressed, and the raw binary string. It should then go further and do the same for any other tags in the file. This is for a project where I am not allowed to use parsing libraries.

I am unsure how to access tags in an XML file without using external libraries and then pull out data within the tags. I understand the high level plan on how to complete the task but do not know what tools I should be using.

EDIT: It occurred to me to mention that there is more to the file before the first < spectrum > tag occurs. When creating the first mzmlFileBuffer, it is only getting the first line of the entire file "< ?xml version="1.0" encoding="ISO-8859-1"? >" and I'm not sure why. It will not access anything with tags in the entirety of the file which is what I would like to figure out how to do.

This is an example of the file:

<spectrum index="0" id="controllerType=0 controllerNumber=1 scan=1" defaultArrayLength="1625">
  <cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="1"/>
  <cvParam cvRef="MS" accession="MS:1000130" name="positive scan" value=""/>
  <cvParam cvRef="MS" accession="MS:1000580" name="MSn spectrum" value=""/>
  <cvParam cvRef="MS" accession="MS:1000127" name="centroid spectrum" value=""/>
  <cvParam cvRef="MS" accession="MS:1000528" name="lowest observed m/z" value="451.17056274414062"/>
  <cvParam cvRef="MS" accession="MS:1000527" name="highest observed m/z" value="1199.9544677734375"/>
  <cvParam cvRef="MS" accession="MS:1000504" name="base peak m/z" value="786.59503173828125"/>
  <cvParam cvRef="MS" accession="MS:1000505" name="base peak intensity" value="6488257.5"/>
  <cvParam cvRef="MS" accession="MS:1000285" name="total ion current" value="114753896"/>
  <scanList count="1">
    <cvParam cvRef="MS" accession="MS:1000795" name="no combination" value=""/>
    <scan instrumentConfigurationRef="IC1">
      <cvParam cvRef="MS" accession="MS:1000616" name="preset scan configuration" value="1"/>
      <cvParam cvRef="MS" accession="MS:1000498" name="full scan" value=""/>
      <cvParam cvRef="MS" accession="MS:1000016" name="scan start time" value="0.259" unitCvRef="UO" unitAccession="UO:0000010" unitName="second"/>
    </scan>
  </scanList>
  <binaryDataArrayList count="2">
    <binaryDataArray encodedLength="6748">
      <cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" value=""/>
      <cvParam cvRef="MS" accession="MS:1000574" name="zlib compression" value=""/>
      <cvParam cvRef="MS" accession="MS:1000514" name="m/z array" value="" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
       <binary>eJwt13tcj3f/wPFrq3A73UM5RhfCpuR2nCm6qBzLMWyEax1sQxFiI1yIm+SYYrfSpYNjuFNtRulyyLA0Jm4bcW3KmMOM8rMpfnt8Xp+/no/3+fP9Vg8PiqKYx3vGaoqiqOnjhMZL6ZjxQm3dBKF9JJS+umFC8/EnQv0/8+gvQvuvRbh/JfUPVwuVumux9TahNScD47Lom5fH3s359P33DGYWs7/4GvURvzB3/BnvuVSF5Y5LuY/Wp/WEamIP4lE9qXsMEyrx44j7TiD2/pA4IEJodNtGHJMk1Adkss9nD/taVWJoFfV3q4VagcMy5usJtc7OQr27u9D8sYvQinyX+soJ1DdFEBfF0Od7TKhOPSm0vV4IlXCn5aKvRT2hcRCt482EynwX8idaCvUnEULTYQZzIz8R2o/QyooRqrVo713E/JRN+GA7fX3zmfc4hsop8n7F3Dl8nvuFN/DRffzrJfU4R0O8o6oZ1mspVJzbC/XmHYXqedTquQttF7TiPai7edKf0JP5h72oJ3xAPWew0Fjrx3x9f+JjI6g3DaT/zDjiK6jsG08+Jpj+/h9SnzGZfNEU9sWGyHeiEhVK/ikaI8IxGc1hMzFU+gKVUfP4nDPn0x8Wg4sW46SV1F/Es39wAnM1W+XnTiR/fTt7L6fxrgaZ5Jtk8z31QSPkKPve5DLvkIet0OyK1gdoBMu4RwF7x0lHniLf7yx98agmoXXnO+5rJdRTUPe/Qt9a1NuX0Z+OWr608BbvafIbNnBYIfakNRBqJU2FimcztNHo6izUe7YRqhaaSR2IfTrS/5a70BqO9h0PHO1J3/1e2KE387NR+UrGb9C43J++N97ko33I1xvAO28Mpl7Hn3kNtWvSoYHcvYlGUDDvdwyn/x8LqAetYk/n9cTqNvoGJfE53FO4/2M6+WuZeDObPXoe+/ej9Qz1ncfl93qaeNlF+nN/YN/uq9ydWEbc5AZ+US6/74d8jjuo93hOfwVqbaq41+cV/W/Qbl/DHSfHlWLuXTQHyrh/faHq1pA4QNrJWWg8R2uwK31+7kLlEhrXUHPpxFybbuwP8CMeNoJ6h0BiczTx0LFCe3gwe6aHkB8Uxnwg2hGzuPvxfN4RiWYlGoUL6PtkCc5ZxftcVlNvlsBc6w3kd23jTp8k8k472VeEap8U5oaj1SiDejHqDbPI3z7Inu25zFWi9fYJ+t6gMrmAfU5n6ddRdSthrycancqYf+8W8UDpSLRH3WH+xF3mAyrwojTwEfWp0iNPmXOq4s4MNJyreV8fNGNryPsrq8RcYweh4eFIfAP1YCficNR/ri+0ezakf0ojjGpCfYSzUDmA9vetV/G9oTnWlb4bHYTqTHfyC9H6b1fiuZ7sSUPdtzf9f6B+sy/9D73p/xONNzIe5sP9F368Xx/JfINA5lxQbRUs3zWFuF8Id4Mj0G0O9a4x1OejrsRS778KB6wj3z6Bft8t3J2aSH7MTuZ9UzBsF3X/TOIlB3j/+YPYOZu5P9BslEfcFbV3Crjb9Sz17iXsSb7K/IgyPuck1FNu4thbzE1He2KF/DmglVhJPk265yH7ox/Rt+MxefU5+QVV7IuX9n/FnmGoPEV7Uw0OcogT76tC5aCjUL1an3xAA/L/bEi+uTP57q5CrZ87LvUg/3UvzPAmfwT1SD/cj+Z/R7LXbYLQOoLGMRn7hjAfNZX4BCoB4fSVzqDutIS4/ipU4+i/F4/+6+mbs4X3P9oqtEds4z17Mpj7JJP9SQeZm5NNfX4u8cA83r0O1R0nMO0M9WxpWTFO+Y69WdKwEt7x0yXuXywjflDBnn6V9OU9J86pv1rcK31HqL7XXKgcbUd+e0eh9XUvofYdqjfRuOIt1Ev86C9HbZc/c6VD8YeR5K1g+udOEdorULsQQrz9Y/pjwtjXIZy4HI15s5iPX0B+5hLe82UceyIT2DM3kfmjO6h7pTC3aTfx/Qz6nPewt/Vh6nknqH9xhvwXV9lbLe1/k76WNvcH3yU/u4Lvrc1D5nui0vE57xhVRd+mGupXlDXifrKD0EhyFGoXGwiVLxtSH+RMvgh173ZC64+O9NnueONd6nEe7HvgKVTX96Jf6UP+1HD25oeQb/8xsXs4d/pHsK/LbMxBw4pGbQH9m1B/soj8/SXEucuZ+34V+z+NY+8R1MevoV/Zxp46KbxzXxr1HRnk62eS90Ll/Wyh+SlaUXn0ZZ5gziggPnCGuXOoRZ/lXs53zB8poX69lPrQMlyBZi7ql29xp8XP9IdXsOcamvdQiXxEPq6Kvr3VzEe/YG+XGvrXox3zmr2jHf8t+reg7uaESQ3R0Vmo/MtFaA1Au56r0AhoS59Te+qHOhK7uwu1UNTPeND/nqfQ3I2qX2/2bUQrypu+YT7EMX70hfuz5wgay0eyZ2Yg9TpB5BsEE69C41KwnJtEvC6E+DqaaaHc8QrnHdNnkN+K1uVP+Bx3pH1msycbtZQo7vWdQ37UAmwRy50JaEyO43uMQvPEeuK5CcT7US3cJL+3ROqlaPpvI78Mzb3JOCKFfHgm92Zlc+93VL3z2OOPVjiacTK/C7WDqObL+kVpeIHci9Z6GWei8QOqjQvJZ5zFArQuoOlSjDNQ3VaCB1F7eZX31CljPgSNo/j3f6gw5zr1q2jOucn8hVvsc68gH4XqtErqno/Ix6D92WP586xCP7Q2o7msmrx3Df0jUf93LXm3t9aKfLGD0ApzFGqH0U5zEpoFdajPbChUgxtTr+csNKKlP7YRKo1diePRbN8Wf+5I/T13nIxqJloe7+KvHryjuyd7QlDdg2YO2t26sccP7VG9iU9Lr3kzH+bD/I++5D/zQ1d/7g1CIxLVm0PIDw/k3jzUD4xGJZh+DdWF0g2otw5h/wg0k1BPCeOd/cLpD5UmoZKOdkoE/T/N4h0tZ/N9fIbKr7Pl9xWNt+aTD12Aq9BMlXERGudQK0F75efcH/AF9z6LJY5CPV7G0+OYX4FKBqp3ZfwE7cQE+g6gVoz6I1S9NtBXk0i9xzbe6Y1GkIw/QWU56utk/ktU02V8WvbdlvFD1BomMaehOVbG6TuxXgp+gMZYVBfLeI2sX0SlUvoctVq0G6TS74xqICpH0+XvSwb33TLJj0LtYxnvk+aiZct4chb7Vx1kf9ds9heh8QDt8EPs35RLPCiPu+ul19D+A7XYfPoHH8dhBeybjdp21KvRWl3I3cFnMQytg6ikFpMPOC//rkqIj0rfoJl1Cd8u4z3eqK5EowTNO6gXXiO/7haWodqpHE+j3rmC+kTUtqCdgdbFSurlaL5EK+Oe/Pt7QPzsIfu8HrFnPxqPHpMf9gd7W1cxNxTNOaitQOWszAdUszewhrnp0pG18u8UzdaO6/j+0d6B2i9oznBax94G1DMaUv9NOqcRfb7O7MlHM9oFA1yZ/xPNhm3p093ZF+EptLr3Zt901FegkSrjUjQL+wjV4f3Y29+H+bVo1/Onrxmqsfj3P/zUH6MxOoC8jdrqQOp7pRYqHkHsqUTz7hj6X40n7xFMPBPVU2jeRL12AnHUFN5ZN4S9PVCdK+MLMg6aSr8ZyjuUcObaof056ndQ6xFBfygquTPo7zCbz3kflX6RzE1CM1Lqs4D6CjSUGIyW/rCQO6GL2Ns4ljl/tBeg5riU/ZvQcomjPh3N9ahkotVxNfVUNMMSuJeM6nW00jcQ19tG3QctA81sVCplviCJvropvMstlfgM6h+kMeeeyf101MpQL89iz/x98vcyG9NQ9z3EfMUh+XuYR/4cGjX57BtTQDytEJ1OMtfuLHOfotX2nPx7KeE9yy6Rdy8l9iljfzwa/reo/5+0QTmfPwP1X++yb2AF8TBUp6NdgUaDSuY+v0+9/iPmWqEeIW3+lDmjis/1+0v6b7zivk8N9bXSOw7xYu8SR6FpolEove0kVFPrEd9qILT3NqT/NGoVjYT6E2kzZ+ptpFtQT0f1tPSVC/t+aMGeC67k/doy91Y7odXSHX9GrU1n3u3qybuGSx+g3t6LuG5v9mmobETjDxkbfbBvX/Y38qH+AjXnAcRrpd/4kl/nz/t2BhAHBhEvRjsLjTPSd0bzjsIJvK/1ZPqLp3K3ezh7GkfQH4vqebTKUesWSXwXzeXziYcsoN8jhvuBaNSiPn8R9dClxI7L+Nyxy4l1g/62a9i3YgN7Om6kz3sb8wlJ3B2QTL39TvJ6Kvl0NPx3ke+cQRySSf+eLD7HyD30h2fT98Uh7hah+fFh+m/nkg/NY88yNNOk3+bz/lkF7FmD5h7p3ULqq87yOVKlj4vZf+88cVYJdzpdYk8tap3KmC9EpfYae/1uMV+M1vAK/Bb1u2i73WPP0Ee8vwLV24/Z5/6EuE4VRqBZhtZb1ew5XC1/j19gkxruzETlH7X030TV9zV7VjquF/mDqE6rg6ebkR/iLNS+dBEqJ6WBrkK9Z1uhMQuts2hGdhLau9AsR31jN/bloX0XzSG9yUeiUtuHfte+7N3kgxEDcC8qjgN5705/nBHAXCnarkPon4dmdCBzK4Por0azVsbuo5hfO4F5l4nkfdD6SNo4hD27p9LnN42+cDSbh+N4VP0i2LsftTNo9p3BvhDUyj/j+zwVST4sivmgBdxZiub5GOozFxInxLL/56X0LV9GPnM17/xVGr8GCxPw7Y3sz03ibr1k5t+kss9tF/UWe9g3IZu5nw6Rb38Ye6Fi5vGuZ/nkfb9i35+FfL6Ak/QtO4tNz+FU1C58i91KeM86tEddIl6CuiXjlqXcm4J2ehn7xl8jLkAt9jruvUX/Z+V8nh3SM6ifr2B+fiX9t9By/1V+z4/ZW4zGXVRPPaeeXEX+KJpbq9lXg5brC/buQVOvZd9+NMtQcXhN/KlTgug/Kn2Keus6QsUb7ZEynoSG0UhoXUDjLxl3aMx8mTOuc2FvKVq3UalG4x/NhWYTtD+QTkZtJurbZP6GK/ndbbFVO6E6GM1EN+7ccOfO3k7o15m+j1BfiOa4LryjXlfmirrhx17UHXpT/7YP84v7sq+TD/dj0LgzgPe9ka4YyLyJluFP/hBqiwPYswbNezKOH4I7pCsDqeeg+iaIePco6vnB3M+ewP46EzETtSJUVk8lfwxVt2nUT6GRGMHey6h2mUH+/dnEY9E+jdr+SOLr0m5R5D9cgEdi+J6fSYct5HsIiaXeZSlxEGoX0KiW+bVxvKfHavJN1pCfjPbeBOJhG9g/H43mG8lnbOPdo5PYvxyVQjS7J9P/uXRmKn2JqJ5Gu/8u+pegMTeL/TtR+5903B7esROtdYcwB5Vf0Jh8GH/PYW+HPOqb87l3HJUa1OZ8heelgwuoHyjEH1B7+yT3Y6TDzrIv/6z8/Smm/yfU2pyTfweX5LtRuYjWMxl3KCWei7bL9+yrKpO/d9fIX0bzpYy9r+OzW9zbXI45qP+G1qDbfN8rKpnPQPsCGk9lPOUe/bseYwHqZai9lIY+4d1WNd6vln9vL3BMLVbXyr+v1+zJdtwg7jjVEVr/Qd2/kVCNlr5Cs0lj+tNRK3WmP8YF32ouVLqg/Q0aXq7U17bFVu3Y64X2GRn3ducdGzrR9z9UhnbGIE/uHuhG3/9QH+dF/h7aOX147y8+3P9uAHn/gcxt9ad+OYD6pCHsnxjEnkfSiFHkD6K9fQJzH03kvZFoG6hkTeU90dO4cyyCvqAZ3IlFMxXV4CjitWjViWG+HP/+fwR7oxdiEdoNl1IfhPYvqC1aRqytxj/RarmGvVvQ+AbV6Rvoe3sj+WRUwpJ4V6tkYm9pTbL8/lPoX55K3GEX+kmPZXJ3ShZ7dqM+eQ93l2UTLzzE/V1oTD/M/L585pZ8Rf+DAvlzKyS+jrpxEpNOM1dUzJ7zqO08x/4WF8i/voRmKXveuk79JOrWbWLPJ9x78JS+0mqsfcH3MPg1/Y+dNgqH1hHa1Y2E2uTGOBvtay5CdaGrtC31xe5CJbAzcWk39vX1It/0X8RJPakX9sOlA4VWDurf+xKvCODejlHky1DLmUj9JholM4hTZwnNH6N414H5zC1ayR7HZPJjsvBcNu/65wXiPZfYE1TKHZ9r7J93m3hYJfvyUb1yn73XH7Ln5GPmdzyhv+h36kXPqJ+vZn7qC+aT0LpTQz7nNXMNHDeJ+m/Sb+oIzSuN8V/OmN1cqPyE5iFX+k+0I34mHeOO6Z2E2vedhfozVD70xEnd0KU7/b/3Rdf3hfZfA5nTA7jTZKjQ8pQeCka/CPoGGdzLiSP+aTV7JqzB/mlCw9pP3ese8QqHzeIdDZsJ9W9aC81pHYVGaBeZf1eohXkI7R1exP/uRf+m99nzUT/mdvrSFzqKvii0oqcQW1PpXzCd+c2LuZO8jL7P1+KreFyayNyq7cRxO4VqWCp7muaSd/gac46xL/E0fWUXyW/6jv7E73mfw1XuJ6NuXif/TDrwJvXBd/hczX4nf/85tvg/9j18zZ00hy3inRl1hcq+BkL7SGts6io0u7oR10pzumKKh1BN6kt8vj978ryZy/UltiZSD4kWWkXzqZ9axPy3S7bwc1spNI7HET9ZS9/prbzTfQfxeyns8zDJT91H/+WD5K8eYc+u0/TvLaV+9TL9f9m8y7xP39M2W0V8tZ1Qq3ITmi/7YeYQfDOC+qFAoeEwWqh2Hiu03xkv1F+j1XQSe9/+kLr3ZKwbQt59OpbPY2/jBOInG9nbfDP7/pnIvvg06l5ZvKfmAHPbDtLvcIR8o1zutEGj6dfMqyfIuxex98rf/j/gwvqS</binary>
    </binaryDataArray>
    <binaryDataArray encodedLength="7872">
      <cvParam cvRef="MS" accession="MS:1000521" name="32-bit float" value=""/>
      <cvParam cvRef="MS" accession="MS:1000574" name="zlib compression" value=""/>
      <cvParam cvRef="MS" accession="MS:1000515" name="intensity array" value="" unitCvRef="MS" unitAccession="MS:1000131" unitName="number of counts"/>
       <binary>eJwNlndAD/obhUsahJJRlMpIRrIJP9s5n++3a2VmRRlZuUgidEVXWbeiXZLSHpSSpvYgKtIwCpWKrIQI/fr7/fO85zmP/rN2OCRoSbdtVxDV6s9g0+0azi2SkWxp7cYWGy+hWvGaP/5W5sfsZsqEdSJyvidMwnLwP7Wn0Pr7DK9+mCyOrHKDxyYF3qt1x/Gb8fjeLxvKmQo8avkTbmnzJYrybvR3cIVxeyAs5D+g4oQdKgZqi/bYH1AM+Emb/is5LfcWEoxlGT3yX/io3MaW+S7oNtKPrz5uZcgjGUYtdcUWuevQ3PCJq1Ujcbo8FBu+ClFOM+FgMYP9vvghssIFq1/+i8aNNzAipb+YquCApy6TxR9LJ947MpFVH825MfKOkLt2EpLCeugV6LNi2Tloz9hJ3ahg7HxxDFV7XiDQeKMkY6SE+cGeyNHvJdJ+vqVKRQOMVl4QZu2/aPoyEKU7/eE85jkM6o9D5Z9cjOgThkUmmVgaIie2TP+X9rrqbD3YibJtnvyjHo51XtXsqD6LJXoFtM2oxr6aJdTRqUM/UzN6DL6AfR1JXDLSBfua33GBnjJzYz/BrFcB+v5zG5LroWJowU9oTAyk7ZMkyma24MtydwZtHs/hj4MgcTwHb6tVXH/7FmYWq4rULZ441T6a82YuIZ8FwTv8NGx7xgoLmzqWdJARDecxUN8Ja/QvwNz6N5JHX5FGGqtxflAO/th/leQayLCv1V4+aLEUf+VcwGbTevFQJw5xW0rx6Ps7luy5BqsdzjDNDUbp3Ep0+ETxw2t7akzbzqdfovHP62KII5aMslWQ9B+WCYtlo7jqcnexlxrcYBkCp8OR+J0kw/nfL3DbASdOk7yFRmcqCjNLYLRtMGcZeTAZVXR+/lW8+ieFDhO687RDI+bI1TJ+1mBOnjJYqjw2QnzXmivRzornHZlgNtemQCcmCSss3uLd1wrIOm3h5lHXMHacL/QDzmJb5x8+23KGuYdykecbg27bx/CZdzNDFoXi/BEJ91ZHQFE9CBvzndF9kwtq9WLRIzKLxyLSuHWODK+q+OFXVaTk+Lo49Hl/XTSb1ojFTfkMr7WjzuUsbBkaDFOvPaIiLQSBHrcZ9LMv/wSl4vT4K1g53B1N8zyheV9NjL+8lQmxzjjxpRN5+zwRbXcWW+97wjRck7a1x7H4fBR6nXHFvmmZkPZrF01DJrPSM4+bV17Ee895orrlGdoKVcTXMWEQWofoG1eJw8Vx7N2vSth9Defz3ta4PMgNMg6J1HacwFDZ59jY6cVgoyA0rT7JZ4E38ML+pWgJ8eCWp0vEyG+tsLENFZNG3EFgo6L4WVaLT07pome7Op+X/wu5uefR/cQTTurXj7Ney9Jp4BLO1/LHkKVpWOY0gTalrojPXEizIfE4dsMXU0IV2BofzEuzE7B773UR7pUP3xU+MFXW5N5eLcwzV+XmzmnC5b90fP2hIrwKE7C9fyCHv7yFqLr5optDJn7vDKPqOme2LvZC631HRm2IRUDEBZxYL7hQMxXBPWPhOXc8/9xWZvLJnQx/s5FmZiEslRfcYe8JrbJmjNVSpa/LQ4Q2GXLli+l880aRTQrxmPBChreClZgdcwHtuudQ9T4SqjWRTLP0Ruz3Z7jbCvosCcLa6DZe3+oBs/DefBDXk8HFL+ClWEmTZVPY+HiaZOmlvZLK1BEMmKQtVu9yFr4LO6D0owI707LFxSHyYsoNPyhfToVnsqo4rJiFSyfd0LYyH9aH4kVkmaxk+Ioc2AcPE/nFtTQOHCuaQjbS+cdUmisHQcNIgY7PUzj8dTtyL3Twwl1tKm5zh19tOCbZWLGmmzyPB5UjoX0brU+W4fy6u5hufwcl7gF4esOXvjN8obkrFpOWF2Bl7nn4K+wVX81cuWiFMg9dnMGFjV6IMc9Ea3odvg8Yw1luScjzjIHnu150WlMOq1uB8OpP1i7IYdtgVYmSxzxxXHIXN6MVxfG6SEQ/lqfM8hsouDyDx9JfwGJlC6qPO0DeIhMahhrsaKlB29dwbHgQAOXQQgw2PI3nKywlbSs/446aROz/OkRctJvCh/fSUP/UD17r3NmyugELDO6LEv18LGjpw6D/xaDHvhqGfr6Ne+flqJlygrF+hQh+l4Yj4zxwKrmLMQ+zcD4nACFzB1Blsw/6XNVj7ahzdEyIwmij6zijrs4XU89CJakRFSWa4k7fcOTdT+FauUSsW1GB3aqJmLY6DSXew+h3YJQkfsFDnHL1QbLvQWERnwikr6JRUymc52vz1tkf8M8P6GLUWLbv88K/vzSo3J6C15uiMdVehnIyIVBQ94LN8SjkqhkypcAHSb0eI6X6K3bmxqJ2wjMsHR+PgRvKEfcuGg22XlBbFwBX54dYmZaEw/Ut6KlnJXHwSBUGW/1QqDCNDvpDOabbEwbMTECR3hb+isvEatVzYsplLwytv8t7juHwtTVgunkScrvufwf5Y83659hZ2IcNqY0oq60U+9o70aFIsbWvpphrrUTfpABe8omXXNbOAP7058lfLmhTKBR5c8IRtk6boX262BVpRiOfFixUCOdBXxdkjcrF+OhCKKZ359HYa9A51J97133hCTe/rp0JQ11PL1qNvoAW7Tli+QRPDPh2Vtxvu87B+0u4JWCMpN91Jf70K8bp0/tFtx25+BO9nNfmBXCFQk+xVrmLdwZJCFrfhx+zAhC7LUlkJ19D0bjZ1NlWjJDWNsQ7zhLfV2ZALXQEc3SuIsEkUdzIzETfs72FzLV8rK3vKRQ+56FJdTHj2zcxcYw2Lxe6o3aTN8zlrZiaIs/pf0VB5tEjXBn0WHJGVlMSOTUR6laK3Ka5lqttndist9k45kcLK2+f4tDr1fD8bUi3iMEcoalLvehi6T/zXiDe4SdMcVYs3FwlPV0eSMcOI842K+Xvj9ex29iTPb4bSlU2nkU3yVO4vjuHMvl6MSppEfuvy4XzNE8YPo7C4+GVVDU6Lu75KvHL14tMtYnlhzVhUHWJhYPdO/GisFQM9oqjyq0qXvuhKQYEThRKBjcx+5EGv85VY+ak05Lpmslw/qTA7uoKNJoXLzLPhiM+I47uX1bx2iAPMb3fUKbeMmJNQwmctRREyZY0KgwrwOS1M8Srid6YPlSPT3IcWI7nmJrUm15OxTCbk4fwF8O5w3acWN6sQYMxDUiO8xATbyVCXs4LRa/jEJ8wRWh0lovIcUHwt6nk6e9tvOacCLNv1fBZGiX2VCTAc+dIcW5rJDrfdRfpMzYx/kUenPuGwskyRzqkuRsXD87HSP9srH4TiO3jY6AyR1VqvleL67Xc8NHmAfrJ+jG8fKixzMAIKnvfxCztKMTsyEbBnWLoD7qCwxM70aPHJ4Tu9JS+qv6NN34LaeA8xPjRk5tYbf4EVX3tOKvTUzrdphXe/vfxuE2HOe/38tDjnhxqOZGWY/vz9cKRLDGN4MtR2hx9zZuiOIKnZfONTXMpglSbabYllnsOatA+bCFTM/uIY3n1KPoyn0rGJxE/MgnJbxVZfcnIOPrDMxa7Taac3BckDv+Etykt6HX7nOihHQx98RjyplrijOE47oh0R8JLFwwJeoFpBZuk93695wunJpQW2dHokztM/tyUPO0dA+2pKdg8PV6c/j5EDPhRA//yqSK+oobFvImCkxVUWxEOteJD/HmqGxtH/YbVw2Rw9XOJXeMvNr42FP92ddXkyzhJydMktv73BLoj0llufhMjptzAoBW3hf2rMH6c3OXWI9TExh3ZXLFhIzcYkNZ05cyWMgbWxyO6+gdmJR+g+q5wJNy8g2jjcKwqCEX2W1eMuZ9N9Tfr6Hy+BnduF2OYqgxTj5Thhk8l3x1L7Nr4APj08YL01G20mcTwVHA8JGerIacXA8fRRVCft5AK83S5KTKfsy+EYu2GLLB7BzLmLuBitYO8/uGLUF9Vi4BKCx44ulnUqMbih1UstDeMkboZT5ckWGbD7r2Eo5YVS1u3juTu4b1oUraI3UY64m6dI1pOzJKOP+KGUO0EHFOxo8sbTeN1XoXoPTMD7obbeXaxDA1DLkn7lxXhg29/5r2vxsbeSWgrEmwwiMHIqPvC0j5VahE9l1m6pVgq14Arz3Jx7Ekb7Gb1F8tD9aW9woZwtvdVWPyqwT5Ve/6pG86kmu6SiF3KIi5cnhZaYZh14xjde3sIr9etbLGuxZAT3uJbZL64v9AXWjFDxdwnkVw0KBezxxfAb97f4q8hmmKxehvOXtnF5MtbxcV/bvKZMKdy8SzuffcFm3q/glEnxLnkANx2Gc9uCmdofsaFu/KSML48AG/HKzHkoRFTBylQZYefGDhWnXObWqFnqShqj1XhROZ2Jg0K5MvnxxCadFOsNIlHdsod1B1qx/iAOczfNUhAeSstNqpxcv+LovhTJhQ7Z1Ht1FMeGPgc8xcNFk8UkvnSYxM1NXzY8ngxX7VF0Ff/Dl5le7Jw2RWoF3lC450zFYeFwUruMiwCXHHHaY9EdoYbpJf/Q3TLa+y0KcK5HFfsD48Rq0b4YL35OOmUVbe6vOkFZ2VfR8QGR4nKnFBazlfkLx0ZqUNHFZQeL+D+xXkY+7AGGrJTJOmGo5jX7I45eUeR6BML6+mxGGT3FUrOY5l2MFDqXfofzU84Y4bncOnOu00o/FGHpLZ4rls6VhofH0n9/WOZrtnldw88JUWnSnC/vjd/DQ6H97zT2NXTVly0aUDToApY53flP1OTpn59qWvnB78HZ2Hg3AITvGRrxjlcOPMQR0cEcXbuFSgfaMTjc+uEU9MNpPvn8aHhPdSuD0Hy72y4tqfiiMxoSf5+R8w/FCkW55cgbWIvbio1IEaFwlYzTHJF9RkUfwj6m2hL6n0u4coEb8SJbInKx0lCS2e6JHekAt8132fj36skS91MiGpvHF1uJpQe14sXbUN5SmOHcJjXSouyAM5q6Clifpaj/O5f4tceVf72DceOyGA0rmzl1OM3xPsFH/k63RPDOu9yQ4GeqDkTys5lyly+a5yk7cIK1v2VCScze3Gx2AdpTtVovF2PJUHuIrTyHANSR4gj+am847qXV4elQN6kBbdf2vOb0nVcntWHNzNewb9yHe3PzmCPT058diMFM/Vd6JZ7BtX2SymjW4IUA31uqvbBqDlv8NmxDdd79aXSmFgYBQVgY49ATP5gR8uLIbAqMxGHdshxWvdUHJxtwR7yg8SbA+nY0f4F57dDmnxoLyNe/sE2t/uS/xQ+wq3sNiqfnJKOdpjPK41ZyO+9mSkmppy2XE1q8TQedgtO0mrzUBr8ljE26PLgkaU5MLn4S0zKzUJAvzL0zPaWdu7xEtst23E67Zp0kNCgtcUTQn+i5MOLGdIfy4NQ+Eadc8LDUKl4Bycy7gid/+KgNron7+b/K82Tj4GubLpkbIuBdImKGo/170d9A33J/35oSBTvxsB1+1hJ9DQ1ETftMwpdvolFJnrcuDAUawbPloz/2QCdNTHCoX44v444SYW9c1ltq8uBf+rx3luNfzIS4Dg5AgeM9JleXMC1hv0ZkXqVdkfn8uJRDzqvsuX65kRU+erxZNA9bH+aw6yAUtjOC8eP1Eo0WKtxWno724tXU2ZRgcS+YyWnLVzLHVIl8eGBOzu7LZH8L6oKy/ffY/smPSl3X8GAwb4i+8RorpPESvLn5OO/2AZ0pmZisZOmOGw3wNh8Sgatn2fAI+OSWKmrIlZK19D/+y3pbP0potwmnsEzJ1HrkYax01h/xkkk7DvkBZR/vRV15z5yyYeXKBkSKy1NMxFOB1yFd40Dx1w4zTW/KqAyfizVxyez2Gq3cfvJd+z3S5MHDmmweftDLguvk0ys0qT3797GTeNy8VjxM8z/mybZmJMq5Jpapd/6XODM5/dhHVQA9eZSybEFF6V2S4PxsshWQp1MaX53dwRYObHuwGxp/yu5uGJjLh2UmQ2Tvv68d7SfuDg9EFMvnUNHcJak8Ngqyc/mXpKDxQ/wwiqAStJgydDgcpESdQeJ5tkw918smVA0n8WHu7OH7ltEnU/BEo2dksQ9m1lbbsg9w0uoWCYn6TsiG/MHGwpT0zA4bdzIYwXaNHv4i7XDRnNO01doKVUwcVk/hkW+R+dQbx6K7kv3HQrcftGXdjk1OGPizXHZi8TemZPZLTEajav0RMPJszwXlsiGjJ7MulZFt8ej6JOiwM97jMSwm8p0H/RQdE85zan1Skw5dZpaazogVXYUXocXUsdNwgVy4K3Gh10/pyxZEh1KETyDUo/tPJkZhy2fo/Eo8qS4127JhaqmwsZwMvsaFuDJ1otsk5kvWV2US5XqPmLI1/60kajyUUEJJl67LZZOMmDdfgcxfUE5wgcN4Je7riLHfhInWO8QGrLX4Vh1FTED6mFRsYVR7cvE9cBbcFY8Lzp6ROLQ6npEXdVgW9xEseHQNvEjuBImlwvxYNwmNoz14Kj8JcLFtGuz3LV5JGQupwojBj5bJbKNPsPC8j23zdouzgTFQD5FwnilJeJQlB4nKI4R3582U02+HbL32hA3XZN3Cnbwk2krVY2fwjZfl2uslRk+6jtnaDTySo+Z9G5Mo+luWR43N2WVjRM3GD5Bqok1ywZcx8KCxQytaoW02oMz717liXotyg99gJiPi6gwPByeYzOoqzGCgyzL0NdgASNHe2PMaC+8utuCu2s+8p+fsVi3cyBrJmVBBL7Cv026PGqrQMWFMfytbcxP8f359FI2ZA8n4bx+N2aV+WJ6hor4OMuDDUc3cry0GHMUE1B3eiktnsdg47OHyJ8wlZZ9ojAt6wR9Hc8xyMUT6mqyXKc8gI/2T+e+xkcI0noHnfq2Lq/UFAdmRmC3JIsTzqdx4XkDar8/w6tTv8Pg/FAaj9XmwWOplL2aCL6rRHFzAM0WybPfX3/z570hYvPIEAwK02U3r3hoVYdRKvyxV6sUVoYPsEQ3mnrrszHcuwMdJx05QiET2aqPEfKoBbbF51H9LR51hqP501WNk8yKsPVKIv6nUY60+344Z52DjMXJGOXViw+2WfIi+lBH5yuswz1QMykRmuOqoX7lDnbuyca9r6P4bpwf7rmvpu4qTzzpWYFXmx7AcXwWFk3X4KcgHb47IEPfU0d4vGcdduyuRP3i+5jf3YVTL2pzjrIMfy6Lxk+zRCw1CoO7jhkTFv3N+B43YfWxj8RdbwZP74mFpo1rVyWaMML5BvaNTMChDGcmueXglZUOQz+0o+xQHOVv3EfzDEUWBcxl1MFsYmA+nq8NxbtjPvy9+zomNhbBadM8DrOUp8ycGMgaj+GW1nDol9ShY99H/Mqax/NF5DNdRW61+4nDRbF4eyITsv1DsNUjEDXOVThV0Yhxy/LQ4hCBWNcbmL9Hhu82xWNyqAEHt6Z0bd8tDA9bxcMXtokRm5VE6X/vIROvxbirPpj9ZRPnbv2OcZZKfNici5ZvEaibcAoTunfimnIJVhu+xdyGDuycnIFwX3km/QpH8Kc17G25gheOL2ROaSEG30qA9UsZTtl6H6dW+KP3rDn8JSsrauob2D07AqVT5XmoyzWU0gKwObQr60INfkt+hC8Dm/Dd1w/p5jFIHbNLOMf2FtlF5vz8JQp5nbE4fFCXlkxAT+cM7HJ0Q47tHNY3ZKCvXwoytNYzOOg+rnbl5DpjiJj4r4wIDdtNtwkZCNlTgd3KcnRp6/I1k3TunV3JN8PGMqXrv1PugiahHxn4tz96hbui9loM7k5W4Ibc8/B0Osref6Vhj5w1JbPvIp592aOkHt3WJuLzzUSsbgtAg7GCuJZTxQTlgeL48+u0VwvEjDF9GVI8gdvqn6Djr1E8GPEHrR7x0P3kB23lS/jH3g3fpbUsGjKaO7NGUWXOD/hGvcTtvAjmVXpwFeZz/IB0/B/ml2Pk</binary>
    </binaryDataArray>
  </binaryDataArrayList>
</spectrum>

This is as far as I've gotten with my code, where my next step would be to use strstr to look through the spectrumString for the appropriate data I want to pull out:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(void) {
    char mzmlFileBuffer[23000];
    char *spectrumPtr;
    char *spectrumEndPtr;
    int indexStart;
    int indexEnd;
    char spectrumString[18000];

    FILE *fp;
    fp = fopen("32_64_compressed.mzML", "r");

    if (fp == NULL) {
        printf("File failed to open...");
        exit(EXIT_FAILURE);
    }

    while (fgets(mzmlFileBuffer, 23000, fp) != NULL) {

        //FIND FIRST OCCURRENCE OF <SPECTRUM
        spectrumPtr = strstr(mzmlFileBuffer, "<spectrum ");
        indexStart = spectrumPtr - mzmlFileBuffer;

        //FIND END OF SPRECTRUM
        spectrumEndPtr = strstr(mzmlFileBuffer, "</spectrum>");
        indexEnd = spectrumEndPtr - mzmlFileBuffer;

        //CREATE NEW STRING BETWEEN INDICES
        strncpy(spectrumString, spectrumPtr, indexEnd - indexStart);

        //IF SPECTRUM IS MS LEVEL 1
        //if (strstr(spectrumString, "name=\"ms level\" value=\"1\"") != NULL) {
        // 
        //}
    }
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 4
    This is going to become a mess. Why not use an existing XML parser? – 5gon12eder Mar 06 '16 at 21:59
  • 2
    If you can't use external parsing libraries, write an internal one. Note that `` can occur in a comment on in a CDATA section, where you need to ignore it. – choroba Mar 06 '16 at 22:07
  • 2
    It is possible to write a quick-and-dirty XML 'parser' this way, but you need to take better care of the basics. `CREATE NEW STRING BETWEEN INDICES`, then `strncpy` – `strncpy` (quite infamously) does **not** automatically add a terminating `0` to the copied string. Hence your commented-out line fails. – Jongware Mar 06 '16 at 22:32
  • @RadLexus The first answer suggests memcpy instead of strnspy, so I will try that instead? – Katelyn Salem Mar 06 '16 at 22:49
  • @5gon12eder As per the instructions of the project for a basic C course, I'm not allowed to use parsing libraries. I guess so that we learn the basics of C first. – Katelyn Salem Mar 06 '16 at 22:50
  • `strncpy` is deemed unsafe because "The strcpy(), strncpy(), stpcpy(), and stpncpy() functions are easily **misused** in a manner which enables malicious users to arbitrarily change a running program's functionality through a buffer overflow attack." (`man` page, my emphasis). Personally I don't have an issue with using it correctly. Incidentally, `memcpy` does not handle the zero-terminating either. – Jongware Mar 06 '16 at 22:55
  • 2
    Then this is an extremely poorly chosen exercise and I feel sorry for you. There are many interesting problems that can be solved with reasonable effort and a basic understanding of C but parsing XML properly is not among them. – 5gon12eder Mar 06 '16 at 22:57
  • @5gon12eder I appreciate the sympathy. It is not how I would choose to learn C but alas, I do not make the projects. – Katelyn Salem Mar 06 '16 at 23:05
  • @KatelynSalem: `memcpy` can be used to copy a string fragment to a char array, it will not null terminate this destination array, so it does not make it a C string, it is your responsibility to set the `'\0'` at the proper offset when you are done copying fragments. It is also your responsibility to keep track of available space in this destination array for the fragments you copy. The C language does not automatically prevent buffer overruns when you try and copy bytes beyond the end of an array. Undefined behavior happens then, which means anything can happen. – chqrlie Mar 06 '16 at 23:28
  • Tell your prof to eat horse shit. This is a terrible assignment. I've been coding in C for years and I want no part of writing an xml parser, certainly not in an entry level C class, my gosh. I can certainly understand the merits of rewriting code to see how it works, but this is not one of those times. A much more "real world" scenario is interfacing your stuff with existing code/libraries. .. that's what this assignment should be. – yano Mar 07 '16 at 04:02
  • You can interpret this problem as "write a full XML parser" and then the problem is crazy. Or you can can interpret it as, "read XML files that look like *this*". That means you can ignore most of XML's dark corners. That's a LOT easier. See my SO answer on how to write a recursive descent parser by hand: http://stackoverflow.com/questions/2245962/is-there-an-alternative-for-flex-bison-that-is-usable-on-8-bit-embedded-systems/2336769#2336769 This kind of answer will work just fine on an XML file with a known tag vocabulary and known structure, and it would make a fine mini-project. – Ira Baxter Mar 09 '16 at 15:34

1 Answers1

3

Writing a fully conformant xml parser is a vast and complex project. Your approach will work if the actual format used for the file is simple and regular, but any change in the presentation will break your simplistic parser.

You should add consistency checks everywhere you can to detect such departures from the expected format.

Note also that strncpy does not do what you think it does. You should NEVER use this function. It is not the right tool for the job. To extract a fragment from a string, use memcpy and add a final '\0' by hand. Incidentally, strncpy will not null terminate the destination if the size argument is less than the length of the source string.

Depending on how large your xml file may be, it may be simpler to read the whole file into a buffer and parse it from this buffer. Reading one line at a time requires you to implement some form of state machine to keep track of the pseudo-parsing stage.

Can you share with us what reasons prevent you from using an xml parsing library? There might be ways to include the parsing code in your project and stay within your constraints and those of the package licence.

EDIT: since you are in a basic C course and not supposed to use external libraries, this assignment is quite surprising. They are probably expecting you to write a quick and dirty solution that searches for tag strings and " string quotes, extracts fragments, possibly converting binary encodings such as base64. It is difficult to make this work in the general case, but for a specific given file, you should be able to produce a working solution.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • The whole file length is 4300000. I'd like to just parse from the whole file, so I may change that as you suggest. As for using parsing libraries, it is simply something that the project specifically says I cannot use. I'm in a basic C program course, so I assume the reason is just so that I learn the basics of C first. – Katelyn Salem Mar 06 '16 at 22:46
  • 1
    I wouldn't say you should “never” use `strncpy`. The use-cases might not be frequent but when you need its functionality, it is a perfectly fine tool. – 5gon12eder Mar 06 '16 at 23:00
  • 1
    @5gon12eder: since the OP is *in a basic C program course*, I insist she should **never** use `strncpy`. She can look at the man page to understand the shortcomings of this function. As for genuine use-cases, they are so vanishingly infrequent that one does not lose anything by never using this function. The added benefit is for the casual reader, who, from my experience, cannot resist the urge to see this misconceived thing as a *cool, safer replacement for strcpy* which it is absolutely not! The less they see it used, the more they see it debased, the better. – chqrlie Mar 06 '16 at 23:22
  • 1
    strncpy() is completely useless (exept for people who know what they do, and they might find other ways) (similar for strtok(), BTW) – wildplasser Mar 06 '16 at 23:22
  • ` it is simply something that the project specifically says I cannot use`. Find another project. With requirements like that, this project is doomed to failure. – Michael Kay Mar 07 '16 at 08:27
  • @MichaelKay: the OP is not working on a *project*, she is working on a assignment for a basic C course. Not really *basic* in my opinion, and not really useful either, but she does not have a choice. – chqrlie Mar 07 '16 at 08:29
  • OK, then it definitely calls for feedback. "do you expect me to write a completely conformant XML parser, or is it OK to handle some subset of XML. If the latter, can you define the subset of XML that we have to handle". – Michael Kay Mar 07 '16 at 08:31
  • @MichaelKay: yes, if she can interact with the teacher, she can do that, otherwise she may have to produce some code that handles a potentially trivial subset of xml along with the adequate explanation: *I have discovered a truly remarkable way to parse xml files which this paper is too small to contain* – chqrlie Mar 07 '16 at 08:36