0

I am trying to parse multiple XML files into columns/tables, but some of the XML has different data and some are not important while other data is important.

ie(XML data):

<setId root="ABD6ECF0-DC8E"/>
<component>
            <section>
                <id root="F08C6A14-8165-458A-BDC8-0B5878EB814D"/>
                <code code="34069-5" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="HOW SUPPLIED SECTION"/>
                <title mediaType="text/x-hl7-title+xml">HOW SUPPLIED</title>
                <text>
                    <paragraph>RENESE (polythiazide) Tablets are available as:</paragraph>
                    <paragraph>1 mg white, scored tablets in bottles of 100 (NDC 0069-3750-66).</paragraph>
                    <paragraph>2 mg yellow, scored tablets in bottles of 100 (NDC 0069-3760-66).</paragraph>
                    <paragraph>4 mg white, scored tablets in bottles of 100 (NDC 0069-3770-66).</paragraph>
                </text>
                <effectiveTime value="20051214"/>
            </section>
        </component>

<component>
            <section>
                <id root="42CB916B-BB58-44A0-B8D2-89B4B27F04DF"/>
                <code code="34089-3" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="DESCRIPTION SECTION"/>
                <title mediaType="text/x-hl7-title+xml">DESCRIPTION</title>
                <text>
                    <paragraph>Renese<sup>&#174;</sup> is designated generically as polythiazide, and chemically as 2<content styleCode="italics">H</content>-1,2,4-Benzothiadiazine-7-sulfonamide, 6-chloro-3,4-dihydro-2-methyl-3-[[(2,2,2-trifluoroethyl)thio]methyl]-, 1,1-dioxide. It is a white crystalline substance, insoluble in water but readily soluble in alkaline solution.</paragraph>
                    <paragraph>Inert Ingredients: dibasic calcium phosphate; lactose; magnesium stearate; polyethylene glycol; sodium lauryl sulfate; starch; vanillin. The 2 mg tablets also contain: Yellow 6; Yellow 10.</paragraph>
                </text>
                <effectiveTime value="20051214"/>
            </section>
        </component>
        <component>
            <section>

                         <manufacturedProduct>
                         <manufacturedMedicine>
                            <code code="0069-3750" codeSystem="2.16.840.1.113883.6.69" codeSystemName="FDA" displayName="NDC"/>
                            <name>Renese</name>
                            <formCode code="C42998" codeSystem="2.16.840.1.113883.3.26.1.1" displayName="TABLET"/>
                         <manufacturedProduct/>
                         <manufacturedMedicine/>

I want the end result to be like this(as in setID, description, and name to be the column names):

setID

ABD6ECF0-DC8E

description

Renese is designated generically as
polythiazide, and chemically as 2H-1,2,4- Benzothiadiazine-7-sulfonamide, 6-chloro- 3,4-dihydro-2-methyl-3-[[(2,2,2 -trifluoroethyl)thio]methyl]-, 1,1-dioxide. It is a white crystalline substance, insoluble in water but readily soluble in alkaline solution.

name

Renese

1 Answers1

1

If I understand correctly, you are trying to parse the XML downloaded from this site : https://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=abd6ecf0-dc8e-41de-89f2-1e36ed9d6535

Actually it’s not a regular XML, it’s HL7 type and specially (Version 3 {urn:hl7-org:v3}) one.

To deal with this quickly there is an open source tools (Mirth : https://www.mirth.com/) which can do the job pretty good and (Iguana : http://www.interfaceware.com/iguana.html, commercial). By the way thank you for your post, it’s give me the occasion to test the Mirth tool.

In practice, you need to convert your xml to HL7V3 format to get the needed information. There is below an example of a channel that i used for your xml and the out put too ( https://www.dropbox.com/sh/ibosv56m0monmcj/AACL7t6ZKOi4P-Bwpi75KhUXa?dl=0 ).

For more information, i suggest that you look here : Convert XML to HL7 messages using Mirth Connect

If after all you need to use python, you can look to the HL7 (http://hl7apy.org/) and FIHR (https://pypi.python.org/pypi/fhir/0.0.4) packages.

For parsing regular XML with python there are several methods described here : How do I parse XML in Python? (Im fun of beautifulsoup and lxml personally).

Hope that can help. Good luck

Community
  • 1
  • 1
NajlaBioinfo
  • 569
  • 3
  • 5