13

I'm trying to figure out the best way to accomplish the following:

  1. Download a large XML (1GB) file on daily basis from a third-party website
  2. Convert that XML file to relational database on my server
  3. Add functionality to search the database

For the first part, is this something that would need to be done manually, or could it be accomplished with a cron?

Most of the questions and answers related to XML and relational databases refer to Python or PHP. Could this be done with javascript/nodejs as well?

If this question is better suited for a different StackExchange forum, please let me know and I will move it there instead.

Below is a sample of the xml code:

<case-file>
  <serial-number>123456789</serial-number>
    <transaction-date>20150101</transaction-date>
      <case-file-header>
       <filing-date>20140101</filing-date>
      </case-file-header>
      <case-file-statements>
       <case-file-statement>
        <code>AQ123</code>
        <text>Case file statement text</text>
       </case-file-statement>
       <case-file-statement>
        <code>BC345</code>
        <text>Case file statement text</text>
       </case-file-statement>
     </case-file-statements>
   <classifications>
  <classification>
   <international-code-total-no>1</international-code-total-no>
   <primary-code>025</primary-code>
  </classification>
 </classifications>
</case-file>

Here's some more information about how these files will be used:

All XML files will be in the same format. There are probably a few dozen elements within each record. The files are updated by a third party on a daily basis (and are available as zipped files on the third-party website). Each day's file represents new case files as well as updated case files.

The goal is to allow a user to search for information and organize those search results on the page (or in a generated pdf/excel file). For example, a user might want to see all case files that include a particular word within the <text> element. Or a user might want to see all case files that include primary code 025 (<primary-code> element) and that were filed after a particular date (<filing-date> element).

The only data entered into the database will be from the XML files--users won't be adding any of their own information to the database.

Ken
  • 3,091
  • 12
  • 42
  • 69
  • depending on the file, i don't think it could be (easily) done in node or many php setups; that's a ton of ram if you really need to parse the XML... if you can "scrape" the xml chunk by chunk (if it's flat+linear, like an RSS file), then about anything would work. you break off a chunk, parse the little piece, then update the db with a newly inserted row. – dandavis Nov 13 '15 at 23:15
  • @dandavis Thank you. By "flat+linear", do you mean no more than one deep? – Ken Nov 13 '15 at 23:24
  • essentially. its where the discrete parts start and end that matters. if there is whitespace around the tags and predictability, then you can reliably parse pieces of it. if you need attribs from "up top" that happen to be 120mb "to the left", it's a PITA and you need to work to connect the pieces. but consider RSS, i can chop a piece from the middle, look for the first "" and the next "" and have what i need, provided i cached the channel info up front... – dandavis Nov 14 '15 at 00:59
  • @dandavis I added a sample of what the XML code looks like. There are thousands of chunks that start end end with `` Is this something that can be parsed chunk by chunk? – Ken Nov 20 '15 at 04:06
  • Why does it have to be php? Is it because it'll be part of a larger system/platform that is in php, or something else? – Rcynic Dec 21 '15 at 23:14
  • @Rcynic It definitely does not need to be PHP. I would prefer to use node.js. – Ken Dec 21 '15 at 23:29

3 Answers3

8

All steps could certainly be accomplished using node.js. There are modules available that will help you with each of these tasks:

    • node-cron: lets you easily set up cron tasks in your node program. Another option would be to set up a cron task on your operating system (lots of resources available for your favourite OS).
    • download: module to easily download files from a URL.
  1. xml-stream: allows you to stream a file and register events that fire when the parser encounters certain XML elements. I have successfully used this module to parse KML files (granted they were significantly smaller than your files).

  2. node-postgres: node client for PostgreSQL (I am sure there are clients for many other common RDBMS, PG is the only one I have used so far).

Most of these modules have pretty great examples that will get you started. Here's how you would probably set up the XML streaming part:

var XmlStream = require('xml-stream');
var xml = fs.createReadStream('path/to/file/on/disk'); // or stream directly from your online source
var xmlStream = new XmlStream(xml);
xmlStream.on('endElement case-file', function(element) {
    // create and execute SQL query/queries here for this element
});
xmlStream.on('end', function() {
    // done reading elements
    // do further processing / query database, etc.
});
forrert
  • 4,109
  • 1
  • 26
  • 38
  • Since it looks like your data does not have any relations (flat records), you could save each record in a NoSQL database (e.g. mongo db). All the mentioned queries should be easy to write/generate from user input. Data IS stored and accessed in JSON format, making accessing it in node.js super easy. – forrert Dec 22 '15 at 17:45
6

Are you sure you need to put the data in a relational database, or do you just want to search it in general?

There don't seem to be any actual relations in the data, so it might be simpler to put it in a document search index such as ElasticSearch.

Any automatic XML to JSON converter would probably produce suitable output. The large file size is an issue. This library, despite its summary saying "not streaming", is actually streaming if you inspect the source code, so it would work for you.

mwhite
  • 2,041
  • 1
  • 16
  • 21
  • No, I'm not sure. That is one of the things I'm trying to figure out. I'd like to allow the user to search the data by code, serial number, name, etc. The xml above is a very simplified version of all of the various fields that are in the actual XML files. The goal is to allow for quick searches, and to be flexible with how that data is eventually displayed. – Ken Dec 19 '15 at 01:01
  • What type of data do they want to be able to retrieve from the database? Are all of the XML files in the same format? Are there other relationships or fields that aren't apparent on the current example XML file? Maybe some more data examples and a description of your planned usage of the data would be helpful. – Taraz Dec 21 '15 at 22:06
  • If there isn't a field in one entity that contains a reference to another entity (e.g. by ID), then you probably don't need a relational database. – mwhite Dec 21 '15 at 22:07
  • @Taraz I've added more information to the question. Hopefully, that helps. Let me know if you need more clarification. – Ken Dec 21 '15 at 23:30
3

I had task with xml files as you wrote. This are principals I used:

  1. All incoming files I stored as is in DB (XMLTYPE), because I need a source file info;
  2. All incoming files parsed with XSL transformation. For example, I see that it is three entity here: fileInfo, fileCases, fileClassification. You can write XSL transformation to compile source file info in 3 entity types (in tags FileInfo, FileCases, FileClassification);
  3. When you have output transformed XML you can make 3 procedures, that inserts data into DB (each entity in DB area).
bummi
  • 27,123
  • 14
  • 62
  • 101