0

I have huge XML datasets (2-40GB). Some of the data is confidential, so I am trying to edit the dataset to mask all of the confidential information. I have a long list of each value that needs to be masked, so for example if I have ID 'GYT-1064' I need to find and replace every instance of it. These values can be in different fields/levels/subclasses, so in one object it might have 'Order-ID = GYT-1064' whereas another might say 'PO-Name = GYT-1064'. I have looked into iterparse but cannot figure out how to in-place edit the xml file instead of building the entire new tree in memory, because I have to loop through it multiple times to find each instance of each ID.

Ideal functionality:

For each element, if a given string is in element, replace the text and change the line in the XML file.

I have a solution that works if the dataset is small enough to load into memory, but I can't figure out how to correctly leverage iterparse. I've also looked into every answer that talks about lxml iterparse, but since I need to iterate through the entire file multiple times, I need to be able to edit it in place

Simple version that works, but has to load the whole xml into memory (and isn't in-place)

values_to_mask = ['val1', 'GMX-103', 'etc-555'] #imported list of vals to mask

with open(dataset_name, encoding='utf8') as f:

    tree = ET.parse(f)
    root = tree.getroot()

    for old in values_to_mask:
            new = mu.generateNew(old, randomnumber) #utility to generate new amt
            for elem in root.iter():
                try:
                    elem.text = elem.text.replace(old, new)
                except AttributeError:
                    pass

tree.write(output_name, encoding='utf8')

What I attempted with iterparse:

with open(output_name, mode='rb+') as f:
    context = etree.iterparse( f )
    for old in values_to_mask:
        new = mu.generateNew(old, randomnumber)
        mu.fast_iter(context, mu.replace_if_exists, old, new, f)

def replace_if_exists(elem, old, new, xf):
try:
    if(old in elem.text):
        elem.text = elem.text.replace(old, new)
        xf.write(elem)
except AttributeError:
    pass

It runs but doesn't replace any text, and I get print(context.root) = 'Null'. Additionally, it doesn't seem like it would correctly write back to the file in place.

Basically how the XML data looks (hierarchical objects with subclasses)

It looks generally like this:

<Master_Data_Object>
 <Package>
      <PackageNr>1000</PackageNr>
      <Quantity>900</Quantity>
      <ID>FAKE_CONFIDENTIALGYO421</ID>
      <Item_subclass>
        <ItemType>C</ItemType>
        <MasterPackageID>FAKE_CONFIDENTIALGYO421</MasterPackageID>
 <Package>
 <Other_Types>

dzifzar
  • 1
  • 1
  • Updated the post to show a bit of the XML data – dzifzar Jul 26 '19 at 18:00
  • Here is a solid walkthrough of converting XML to a `pandas` dataframe, at which point filtering your dataset and applying masks is highly efficient: https://medium.com/@robertopreste/from-xml-to-pandas-dataframes-9292980b1c1c – rahlf23 Jul 26 '19 at 18:05

2 Answers2

0

Since Lack of Dataset , I would like to suggest you to

1) use readlines() in loop to read substantial amount of data at a time

2) use a regular expression for identifying confidential information (if Possible) then replace it.

Let me know if it works

Anshul
  • 1,495
  • 9
  • 17
0

You can pretty much use SAX parser for big xml files. Here is your answer - Editing big xml files using sax parser

codinnvrends
  • 264
  • 2
  • 8