1

Running a massive xml and don't have access to the code they used to get the information from sql, so I can't parse it from that end. Need to stop repeating elements, but someone still capture all the info inside of that element, and keep it attached to the original element that had duplicate integers.

xml is formatted like this:

?xml version="1.0" encoding="UTF-8"?>
<root>
    <clients>
        <client>
            <uci>5581</uci>
            <contact>
                <firstName>FRANK</firstName>
                <middleName>N.</middleName>
                <lastName>FUR</lastName>
                <address>
                    <address1>9 LOA</address1>
                    <address2></address2>
                    <city>TENO</city>
                    <province>MAULE</province>
                    <postalCode>1000000</postalCode>
                    <country>CHILE</country>
                </address>
            </contact>
        </client>
        <client>
            <uci>5585</uci>
            <contact>
                <firstName>ANN</firstName>
                <middleName>P.</middleName>
                <lastName>TERR</lastName>
                <address>
                    <address1>5 KING ST</address1>
                    <address2></address2>
                    <city>BARRIE</city>
                    <province>ON</province>
                    <postalCode>L4M 2N9</postalCode>
                    <country>CANADA</country>
                </address>
            </contact>
            <contact>
                <firstName>ANN</firstName>
                <middleName>P.</middleName>
                <lastName>TERR</lastName>
                <address>
                    <address1>5 KING ST</address1>
                    <address2></address2>
                    <city>BARRIE</city>
                    <province>ON</province>
                    <postalCode>L4M 2N9</postalCode>
                    <country>CANADA</country>
                </address>
            </contact>
        </client>
</root>
----------------------------------------------------------------------------------

If you'll notice on <uci>5585<uci> a second contact is added before it moves on to the next client.

In the xml doc I was given, all of those second/third/etc contacts were assigned a uci, even though they were duplicated.  So instead of something like this:

  <clients>
        <client>
            <uci>5581</uci>
            <contact>
                <firstName>FRANK</firstName>
                <middleName>N.</middleName>
                <lastName>FUR</lastName>
                <address>
                    <address1>9 LOA</address1>
                    <address2></address2>
                    <city>TENO</city>
                    <province>MAULE</province>
                    <postalCode>1000000</postalCode>
                    <country>CHILE</country>
                </address>
            </contact>
        </client>
        <client>
            <uci>5581</uci>
            <contact>
                <firstName>Justin</firstName>
                <middleName>T</middleName>
                <lastName>Thomas</lastName>
                <address>
                    <address1>9 LOA</address1>
                    <address2></address2>
                    <city>TENO</city>
                    <province>MAULE</province>
                    <postalCode>1000000</postalCode>
                    <country>CHILE</country>
                </address>
            </contact>
        </client>
------------------------------------------------------------------------
it would be like this for all duplicated ucis:
 <clients>
        <client>
            <uci>5581</uci>
            <contact>
                <firstName>FRANK</firstName>
                <middleName>N.</middleName>
                <lastName>FUR</lastName>
                <address>
                    <address1>9 LOA</address1>
                    <address2></address2>
                    <city>TENO</city>
                    <province>MAULE</province>
                    <postalCode>1000000</postalCode>
                    <country>CHILE</country>
                </address>
            </contact>
            <contact>
                <firstName>Justin</firstName>
                <middleName>T</middleName>
                <lastName>Thomas</lastName>
                <address>
                    <address1>9 LOA</address1>
                    <address2></address2>
                    <city>TENO</city>
                    <province>MAULE</province>
                    <postalCode>1000000</postalCode>
                    <country>CHILE</country>
                </address>
            </contact>
        </client>
---------------------------------------------------------------------------

I have no way of accessing the FileWriter that was used, but still need to make these changes.
I'm guessing Python?  Originally I was just going to edit this in notepad++ by hand, but once I looked at the file, it's 50,000+ lines.

You'll notice that I can't just do find duplicate lines, as that would pull in things like 
<country>ABC<country>
<country>ABC<country>

Would appreciate any and all help



Tried Find and Replace in Notepad++ and even tried doing a delete every so many lines, but that doesn't format the xml properly and apparently some of these accounts have 3+ contacts.

I also thought about finding duplicates in C#, but not sure how to replace it so that the xml formats correctly.  I even tried grabbing the base data and just running an xml in excel after trying to delete every other line in the uci column, but that deletes all of the contact info, as name/address/etc are tied to the uci.  

I thought this whole format was going to be needlessly complicated, and now I'm the one who has to fix it.
Baron89
  • 11
  • 2

0 Answers0