2

I am trying to help out a client who was unhappy with an EMR (Electronic Medical Records) system and wanted to switch but the company said they couldn't extract patient demographic data from the database (we asked if they can get us name, address, dob in a csv file of some sort, very basic stuff) - yet they claim they couldn't do that. (crazy considering they are using a sql database). Anyway - the way they handed over the patients were in xml files and there are about 40'000+ of them. But they contain a lot more than the demographics. After doing some research and having done extensive Perl programming 15 years ago (I admit it got rusty over the years) - I thought this should be a good task to get done in Perl - and I came across the XML::Twig module which seems to be able to do the trick. Unfortunately the xml code that is of interest looks like this:

<==snip==>

<patient extension="Patient ID Number">  // <--Patient ID is 5 digit number)
  <name>
    <family>Patient Family name</family>
     <given>Patient First/Given name</given>
     <given>Patient Middle Initial</given>
  </name>
  <birthTime value=YEARMMDD"/>

more fields for address etc.are following in the xml file.

<==snip==>

Here is what I coded:

my $twig=XML::Twig->new( twig_handlers => {
  'patient/name/family'      => \&get_family_name,
  'patient/name/given'       => \&get_given_name
});
$twig->parsefile('test.xml');

my @fields;

sub get_family_name {my($twig,$data)=@_;$fields[0]=$data->text;$twig->purge;}
sub get_given_name {my($twig,$data)=@_;$fields[1]=$data->text;$twig->purge;}

I have no problems reading out all the information that have unique tags (family, city, zip code, etc.) but XML:Twig only returns the middle initial for the tag. How can I address the first occurrence of "given" and assign it to $fields[1] and the second occurrence of "given" to $fields[2] for instance - or chuck the middle initial.

Also how do I extract the "Patient ID" or the "birthTime" value with XML::Twig - I couldn't find a reference to that. I tried using $data->findvalue('birthTime') but that came back empty.

I looked at: Perl, XML::Twig, how to reading field with the same tag which was very helpful but since the duplicate tags are in the same path it is different and I can't seem to find an answer. Does XML::Twig only return the last value found when finding a match while parsing a file? Is there a way to extract all occurrences of a value?

Thank you for your help in advance!

Community
  • 1
  • 1
yoeddy
  • 23
  • 2

1 Answers1

2

It is very easy to assume from the documentation that you're supposed to use callbacks for everything. But it's just as valid to parse the whole document and interrogate it in its entirety, especially if the data size is small

It's unclear from your question whether each patient has a separate XML file to themselves, and you don't show what encloses the patient elements, but I suggest that you use a compromise approach and write a handler for just the patient elements which extracts all of the information required

I've chosen to build a hash of information %patient out of each patient element and push it onto an array @patients that contains all the data in the file. If you have only one patient per file then this will need to be changed

I've resolved the problem with the name/given elements by fetching all of them and joining them into a single string with intervening spaces. I hope that's suitable

This is completely untested as I have only a tablet to hand at present, so beware. It does stand a chance of compiling, but I would be surprised if it has no bugs

use strict;
use warnings 'all';

use XML::Twig;

my @patients;

my $twig = XML::Twig->new(
    twig_handlers => { patient => \&get_patient }
);
$twig->parsefile('test.xml');

sub get_patient {
    my ($twig, $pat) = @_;

    my %patient;

    $patient{id} = $pat>att('extension');

    my $name         = $pat->first_child('name');yy
    $patient{family} = $name->first_child_trimmed_text('family');
    $patient{given}  = join ' ', $name->children_trimmed_text('given');

    $patient{dob}    = $pat->first_child('birthTime')->att('value');

    push @patients, \%patient;
}
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Hi, thank you for the quick response, I will try that tonight. But to your question - each patient has their own xml file - and we have about 40'000+ of them that I need to parse. And yes, I forgot the closing of patient in the last line - had a bit of trouble figuring out the formatting in the question box - and then forgot to add that at the end (since copy&paste doesn't work in the text box for code it seems). – yoeddy Aug 22 '16 at 13:14
  • Hi Borodin, thank you so much for your help - what you posted helped me understand the Twig module much better and I am able to parse the data that I need. No I can work on cleaning up the data and run it through these files. Thank you! – yoeddy Aug 25 '16 at 04:15
  • Lastly one more question - I love the pushing of the hash into the global array - very neat and clean. My question is - if I have to parse 40'000 files, is this going to be an issue? Having 40'000+ hashes in that array? Can perl handle such large arrays - or would it be better to write the results of each file parse straight into a csv file? I guess I am asking are small file operations more resource intensive than filling up the array with 40'000 hashes and writing them at the end into a csv file? thank you! :) – yoeddy Aug 25 '16 at 05:20
  • @yoeddy: I'm pleased to help, but I was unclear about the nature of your data and would have written something rather different if I knew that each file contained a single person's data. If all you need to do with this is to create a CSV (you should use [Text::CSV](https://metacpan.org/pod/Text::CSV) for that) then I would write a line of output for each input XML file and forget about the array of hashes altogether. Is that the entire content of an XML file, or is there more data? Is `patient` the root element, or is there more markup that encloses it? – Borodin Aug 25 '16 at 15:14
  • no worries, I should have been more clear. But I am actually happy because I liked that solution, even if I will look into the Text::CSV module. There is a bunch of other markup around the patient element - but I am able to isolate just the patient data I need and disregard the rest. But it was your examples code that helped me figure out how TWIG works, the help and examples were a bit generic and I couldn't figure out how to apply it to the weird xml files that were handed to us. So thank you, that was a big help. I will let you know if I need a little nudge with the text::CSV module. ;-) – yoeddy Aug 26 '16 at 03:12
  • @yoeddy: You should seriously consider storing *everything* in a proper database. (An SQLite database would need no persistent system process and would be a single data file.) You can be sure that you will be asked for "just one other" field from the original data tomorrow! – Borodin Aug 26 '16 at 03:18