Very simple XML parsing in Python

Question

I am trying to take a very long xml version of a spreadsheet (.xlsx) and remove one "line" (not sure what the right terminology is). I am not planning to do this again, so I don't really need to understand whats going on, just looking to get it done.

Anyway, as a smaller example, say I have this xml file:

   <?xml version="1.0" encoding="UTF-8" standalone="yes"?><worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac"><dimension ref="A1:C15"/><sheetViews><sheetView topLeftCell="B1" workbookViewId="0"><selection activeCell="B2" sqref="B2:C2"/></sheetView></sheetViews><sheetFormatPr defaultRowHeight="14.25" x14ac:dyDescent="0.45"/><cols><col min="1" max="1" width="0" hidden="1" customWidth="1"/><col min="2" max="2" width="61" customWidth="1"/><col min="3" max="3" width="92" customWidth="1"/></cols><sheetData><row r="1" spans="1:3" s="4" customFormat="1" ht="115.5" customHeight="1" thickTop="1" thickBot="1" x14ac:dyDescent="0.5"><c r="A1" s="5"/><c r="B1" s="9"/><c r="C1" s="10"/></row><row r="2" spans="1:3" s="6" customFormat="1" ht="19.899999999999999" thickTop="1" thickBot="1" x14ac:dyDescent="0.6"><c r="B2" s="107" t="s"><v>307</v></c><c r="C2" s="108"/></row><row r="3" spans="1:3" ht="14.65" x14ac:dyDescent="0.45"><c r="B3" s="73" t="s"><v>308</v></c><c r="C3" s="73" t="s"><v>309</v></c></row><row r="4" spans="1:3" ht="43.9" x14ac:dyDescent="0.45"><c r="B4" s="74" t="s"><v>311</v></c><c r="C4" s="75" t="s"><v>339</v></c></row><row r="5" spans="1:3" ht="58.5" x14ac:dyDescent="0.45"><c r="B5" s="76" t="s"><v>333</v></c><c r="C5" s="75" t="s"><v>340</v></c></row><row r="6" spans="1:3" ht="29.25" x14ac:dyDescent="0.45"><c r="B6" s="74" t="s"><v>324</v></c><c r="C6" s="75" t="s"><v>341</v></c></row><row r="7" spans="1:3" ht="29.25" x14ac:dyDescent="0.45"><c r="B7" s="76" t="s"><v>267</v></c><c r="C7" s="75" t="s"><v>342</v></c></row><row r="8" spans="1:3" ht="14.65" x14ac:dyDescent="0.45"><c r="B8" s="76" t="s"><v>266</v></c><c r="C8" s="77" t="s"><v>343</v></c></row><row r="9" spans="1:3" ht="14.65" x14ac:dyDescent="0.45"><c r="B9" s="76" t="s"><v>332</v></c><c r="C9" s="75" t="s"><v>344</v></c></row><row r="10" spans="1:3" ht="117" x14ac:dyDescent="0.45"><c r="B10" s="76" t="s"><v>334</v></c><c r="C10" s="75" t="s"><v>345</v></c></row><row r="11" spans="1:3" ht="29.25" x14ac:dyDescent="0.45"><c r="B11" s="76" t="s"><v>335</v></c><c r="C11" s="75" t="s"><v>346</v></c></row><row r="12" spans="1:3" ht="29.25" x14ac:dyDescent="0.45"><c r="B12" s="76" t="s"><v>310</v></c><c r="C12" s="75" t="s"><v>347</v></c></row><row r="13" spans="1:3" ht="43.9" x14ac:dyDescent="0.45"><c r="B13" s="74" t="s"><v>312</v></c><c r="C13" s="75" t="s"><v>348</v></c></row><row r="14" spans="1:3" ht="15" customHeight="1" x14ac:dyDescent="0.45"><c r="B14" s="76" t="s"><v>336</v></c><c r="C14" s="75" t="s"><v>349</v></c></row><row r="15" spans="1:3" ht="58.5" x14ac:dyDescent="0.45"><c r="B15" s="76" t="s"><v>331</v></c><c r="C15" s="75" t="s"><v>350</v></c></row></sheetData><sheetProtection algorithmName="SHA-256" hashValue="MeLuWwuvSadTE2AfpsCLlDGDNMvG0JwGYZE8v0YTGJU=" saltValue="yrhU49K7PCR2uiiV5Olfqg==" spinCount="100000" sheet="1" objects="1" scenarios="1"/><sortState ref="B4:C15"><sortCondition ref="B4:B15"/></sortState><mergeCells count="1"><mergeCell ref="B2:C2"/></mergeCells><pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/><pageSetup paperSize="9" orientation="portrait" r:id="rId1"/><drawing r:id="rId2"/></worksheet>

All I want to do is remove the following part of code:

<sheetProtection algorithmName="SHA-256" hashValue="MeLuWwuvSadTE2AfpsCLlDGDNMvG0JwGYZE8v0YTGJU=" saltValue="yrhU49K7PCR2uiiV5Olfqg==" spinCount="100000" sheet="1" objects="1" scenarios="1"/>

And then print the rest of the code as it was.

When I try do something like:

file = ET.parse('a.xml')
root = file.getroot()
for elem in root.findall("."):
    x = elem.find('sheetProtection')
    if x is not None:
            print("Found one!")
            elem.remove(check_elem)
file.write('b.xml')

First issue: it doesn't do what I want -it still prints the .

Second issue: The output b.xml includes all sorts of ns0, ns1 etc that weren't there before. I need the code to be as it was so it can be read back in excel.

Any help would be great!

If you don't care and it's a one time thing, `file_data.replace("""""", '')` would do the trick? — Torxed, May 25 '20 at 13:46
Wait sorry, I do care a little. I don't know that the line I want to delete in the real xml file has exactly that value. I just know it starts with sheetProtection. So something slightly more general will be needed. — user434180, May 25 '20 at 13:51
Sure, this is why it's important to not be **to** nonchalant when asking for help as we need specifics hehe. `your_data = re.sub('', '', your_data)` — Torxed, May 25 '20 at 13:55
Altho I share your opinion @kjhughes, I'm pretty sure there's a better way to phrase that :) — Torxed, May 25 '20 at 13:56
No I am with @kjhughes, I don't love this. I don't see the value in paying the fixed cost to learn XML to never use it again. — user434180, May 25 '20 at 13:56
@Torxed: It's phrased exactly as intended. As to your regex suggestion, it's misguided with even normal XML; for OOXML, which is very complex XML, it's worse than not responding. — kjhughes, May 25 '20 at 13:57
First simulate your plan via a text editor. There are many dependencies between elements and files in an OOXML document; you cannot in general just weed out an arbitrary element without attending to the dependencies. If your simulation works, then you can move over to doing it programmatically; if not, you'll have to read the OOXML spec and/or spend some time Eric White's tutorials. — kjhughes, May 25 '20 at 14:07
Programatically, it's simple to remove a single element via XSLT, and XSLT will scale to more complex transformations if your research indicates they're required. See this [getting started with XSLT/DOCX](https://stackoverflow.com/q/38298174/290085) q/a to get going. Then lookup any of the many examples about how to remove elements via the identity tranformation. — kjhughes, May 25 '20 at 14:10
If you're not up for XSLT, at least learn XPath with python (as you've tagged) per [this q/a](https://stackoverflow.com/q/61987189/290085) but with the XPath changed to match your targeted element. You'll need to [account for XML namespaces in your XPath?](https://stackoverflow.com/q/40796231/290085). — kjhughes, May 25 '20 at 14:13
If you really are only doing it once, why not just make the edit in Excel? — Karl Knechtel, May 25 '20 at 14:16

Very simple XML parsing in Python

0 Answers0