1

I have to parse a large amount of XML files and write it to a text file. However, some of the XML files contain special/illegal characters. I have tried using different methods such as Escaping strings for use in XML but I could not get it to work. I know 1 way to solve this is by editing the XML file itself, but there are thousands of files.

import os
from xml.etree import ElementTree
from xml.sax.saxutils import escape

fileNum = 0;
saveFile = open('NewYork_1.txt','w')

for path, dirs, files in os.walk("NewYork_1"):
   for f in files:
      fileName = os.path.join(path, f)
      with open(fileName,'r', encoding='utf-8') as myFile:
        # print(myFile.read())
        if "&" in myFile:
            myFile = myFile.replace("&", "&") #This does not work.

Also, some of the XML files also have unicode in them. There are emojis.

<Thread>
   <ThreadID></ThreadID>
   <Title></Title>
   <InitPost>
      <UserID></UserID>
      <Date></Date>
      <icontent></icontent>
   </InitPost>
   <Post>
      <UserID></UserID>
      <Date></Date>
      <rcontent></rcontent>
  </Post>
</Thread>
Kamarul Adha
  • 113
  • 1
  • 2
  • 10
  • did you check `print(myFile)` ? If you want this in file then you have to `write()` it. – furas Feb 16 '20 at 00:44
  • Do you have a short example of the badly behaving xml? How are you parsing the xml? Unicode is legal in xml, the emojis shouldn't cause a problem. – tdelaney Feb 16 '20 at 00:50
  • My mistake. The above code is just to show how I parse the files. The full program is able to write to an output file. When there is a special character in the XML file, it will give an error saying xml.etree.ElementTree.ParseError: undefined entity. Hence it will not write the full result. – Kamarul Adha Feb 16 '20 at 00:50
  • Is it in fact an entity (`&something;`) and does the xml have a declaration like `` at the top? The reason for an example is so that we can see what alternatives are available. – tdelaney Feb 16 '20 at 00:53
  • 1
    If you replace all `"&"` with `"&"`, you run the risk of corrupting a legitimate xml entity. – tdelaney Feb 16 '20 at 01:03
  • At the start of the xml file, it only has – Kamarul Adha Feb 16 '20 at 01:09
  • You may want to rewrite the first line with a full xml declaration. When you say unicode emojis, do you know if the file is utf-8 or utf-16 encoded? Your example doesn't show any stray ampersands. I'm just trying to figure out if they are valid entity references. – tdelaney Feb 16 '20 at 01:18
  • I know all of the files are utf-8 encoded.. – Kamarul Adha Feb 16 '20 at 01:20
  • I'm not sure I understand your question. We have very little information to work with here. – AMC Feb 16 '20 at 01:20
  • I will try to explain this as simple as possible. Basically I have to convert xml files that was given from my client. She asked me to convert those xml files into text files. That's pretty much it really. – Kamarul Adha Feb 16 '20 at 01:32
  • @KamarulAdha I was hoping for more detail. – AMC Feb 16 '20 at 02:30
  • There are no errors in the XML snippet in the question, so that does not provide any clues. And "some of the XML files also have unicode in them" is not helping either. ALL characters are Unicode characters. – mzjn Feb 16 '20 at 06:49

3 Answers3

3

Try this:

import os
from xml.etree import ElementTree
from xml.sax.saxutils import escape

fileNum = 0;
saveFile = open('NewYork_1.txt','w')
saveFile.close()
for path, dirs, files in os.walk("NewYork_1"):
   for f in files:
       fileName = os.path.join(path, f)
       with open(fileName,'a', encoding='utf-8') as myFile:
           myFile=myFile.read()
           if "&" in myFile:
                myFile = myFile.replace("&", "&#38;")

I personally would generate a list of files to read and iterate through that list rather than use os.walk (if you're getting the list from a previous function or a separate script, you could always create a text file with each txt file on a separate line and iterate through lines rather than getting from a variable to save RAM), but to each his own.

As I said, though, I'd discard the whole idea of replacing special characters and use bs4 to open the files, search for which elements you're looking for, and grab from there.

import bs4
list_of_USER_IDs=[]
with open(fileName,'r', encoding='utf-8') as myFile:
    a=myFile.read()
    b=bs4.beautifulSoup(a)
    for elem in b.findAll('USERID'):
        list_of_USER_IDs.append(elem)

That returns the data between the USERID tags, but it'll work for whatever tag you want data from. No need to directly parse xml. It's really not much different than HTML and beautifulSoup is made for that, so why reinvent the wheel?

mzjn
  • 48,958
  • 13
  • 128
  • 248
  • saveFile = open('NewYork_1.txt','w') is not within a with clause... and str(string) may be redundant, but some machines will parse a string as an object. I've seen it happen even with list elements on some versions of python. "elem[i]" in a for loop, for example, could *possibly* behave differently without str() than with it, even though by default a list element taken that way is a string object. – Pete Marise Feb 16 '20 at 01:05
  • `MyFile` D: ... `my_file` :D – AMC Feb 16 '20 at 01:20
  • I assume those 2 lines are inserted after `with open(fileName,'r', encoding='utf-8') as myFile:` If that's the case, it gave me an error message. For some reason it gives me FileNotFoundError: [Errno 2] No such file or directory. Strange how it does not give that error when I delete your code. – Kamarul Adha Feb 16 '20 at 01:40
  • No. You close the file on the line following the command to open as a "w" type. saveFile = open('NewYork_1.txt','w') saveFile.close() – Pete Marise Feb 16 '20 at 01:44
  • Sorry, I don't quite follow. I am still trying to figure out where to properly insert your line of code. As for the file, at the end of the program, I already have `saveFile.close()` – Kamarul Adha Feb 16 '20 at 03:54
  • Yes, your code did work for me. I managed to parse and convert all of the XML files. Is there any way to check the next character after the '&'? Your code, `myFile = myFile.replace("&", "&")` works wonders. However I just realised, some of the XML have other escape sequence such as '. Hence, is there a way for me to convert the ampersands that does not have '#' after it? – Kamarul Adha Feb 16 '20 at 22:13
  • 1
    Not quite sure what you mean. ' is the unicode for apostrophe, so just use replace the ampersand with a single quote in the replace code, and the unicode as what it's replacing. If you don't know what a particular escape sequence refers to, google the escape sequence. For example $#40; will get translated to ( in google if you search $#40; . Make a replace for each character you need the unicode for. – Pete Marise Feb 16 '20 at 23:06
  • 1
    For example: myFile=myFile.replace("'", "'").replace("\)","("). I think you use that backslash in this instance for the parentheses. Kinda burnt out today. Try it with and without to see which works. Or you could download a table of characters and their unicode equivalents – Pete Marise Feb 16 '20 at 23:10
  • I managed to solve this problem using if else statements.. Basically I converted all XML escape sequence to a string, such as ' into "SingleQuote". After this, only then I change every other ampersands into the correct escape sequence which is &. Then I changed every string that has "SingleQuote" back into the original code, '. By doing this I did not disturb any other escape sequence. Not the prettiest way of doing it, but still managed to solve it. – Kamarul Adha Feb 17 '20 at 21:36
  • if it works, it doesn't have to be pretty ;) Good work! – Pete Marise Feb 17 '20 at 22:23
1

One more thing...

You can use BeautifulSoup to just find the parts of the xml file that you're extracting data from. Would stagger processing as-needed rather than loading the entire file into a variable and then iterating element by element.

For example:

If your xml file has tags such as:

<pos reviews>
<review_id>001<review_id>
<reviewer_name>John Stamos (From full house)</reviewer_name>
<review_text>This is a great item</review_text>

You could just beautifulsoup.findAll(whatever tag) and just grab that data. I don't know what you're doing specifically, though. Figured it was worth suggesting.

  • This should probably be an edit to your other answer, no? _Would stagger processing as-needed rather than loading the entire file into a variable and then iterating element by element._ How does using BeautifulSoup lead to that? – AMC Feb 16 '20 at 01:22
  • BeautifulSoup opens and reads the document, but ONLY returns the tag requested. A variable takes up RAM, even if it is a string or an integer. Imagine loading xml files that contain gigabytes worth of data into a variable. Takes up way more ram than say a few kb of data out of that file. And iterating over that variable takes up processing. Bigger variable->bigger processing task. I recently built an NLP program that severely overtaxed my processor. Had to rewrite it three times to get it to run properly on my hardware. – Pete Marise Feb 16 '20 at 01:30
  • _BeautifulSoup opens and reads the document, but ONLY returns the tag requested._ I thought all the data was read into memory? – AMC Feb 16 '20 at 01:36
  • Perhaps for as long as it takes to extract the information, sure. But say you load the entire document as a string. Not sure exactly if it even loads the whole thing into memory. I think beautifulSoup findALL just opens the document and looks for instances. In which case, it's not putting that file in memory - it's just reading from ROM and storing the elements that were searched for into memory. If it's a gigabyte-sized file and you load the whole thing into a variable, it's holding a gigabyte of data in RAM throughout the rest of the program unnecessarily... – Pete Marise Feb 16 '20 at 01:50
  • _But say you load the entire document as a string. Not sure exactly if it even loads the whole thing into memory._ If reading the entire file into a string keeps the contents in memory? _I think beautifulSoup findALL just opens the document and looks for instances._ I don't think I've heard of BeautifulSoup working like this, do you have any resources on the subject? – AMC Feb 16 '20 at 01:55
  • _I recently built an NLP program that severely overtaxed my processor. Had to rewrite it three times to get it to run properly on my hardware._ I like optimized code, is it available anywhere? GitHub? – AMC Feb 16 '20 at 01:56
  • It's for a client :/ Can't share. But shoot me an email if you want some code snippets that'll give you some tricks for getting more AI out of less GPU and RAM. Petemarise32@gmail.com – Pete Marise Feb 16 '20 at 02:21
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/207913/discussion-between-amc-and-pete-marise). – AMC Feb 16 '20 at 02:29
0

What you can do is read the file in as a string.

file_str = open("xml_file.xml", "r+")

Then use the fromString function of xml.etree.ElementTree

tree = xml.etree.ElementTree.fromString(file_str)

Then do stuff with it.

root = tree.getRoot()

This way you don't have to worry about writing code for every emoji, and invalid sign. If you want to write emojis and special utf-8 characters to xml, you have to add this line to the top.

 <?xml version="1.0" encoding="utf-8" ?>
DontBe3Greedy
  • 564
  • 5
  • 12
  • Correct me if I am wrong. Are you suggesting to parse the xml file as a string?So that way it can read ampersand and other special characters? This way it does not have to interpret individual character in the file? – Kamarul Adha Feb 16 '20 at 01:11
  • I am saying read the xml file in as a string. Then put that string into the fromStringFunction. So basically read it like a normal text file ``` file1 = open("myfile.txt","r+") file_str = file1.read() tree = xml.etree.ElementTree.fromString(file_str) root = tree.getRoot() ``` – DontBe3Greedy Feb 16 '20 at 01:24
  • my previous comment was formatted very poorly sorry. This is what I meant to say if you didn't understand it. Read the XML file in as a string file_str = open("myfile.txt", "r+").read() Then pass it into the fromString function tree = xml.etree.ElementTree.fromString(file_str) then do stuff with it root = tree.getRoot() I will aslo edit my answer for you – DontBe3Greedy Feb 16 '20 at 01:36
  • Thank you for formatting your answer. I assume the code to read file can be modified to read the folder? – Kamarul Adha Feb 16 '20 at 01:46
  • What do you mean by reading a folder? Do you mean to read the files in the folder individually? Don't try to parse the at the same time. Because that would lead to some problems. – DontBe3Greedy Feb 16 '20 at 01:59
  • Yes, I do mean reading the folder that has all of the xml files. In my program, I used `for path, dirs, files in os.walk("NewYork_1"):` to read all of the files in that directory. – Kamarul Adha Feb 16 '20 at 02:01
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/207912/discussion-between-kamarul-adha-and-dontbe3greedy). – Kamarul Adha Feb 16 '20 at 02:05
  • 1
    Yeah, that is fine, as long as you don't try to read all the files into one string and pass that string because then, there would be no root, which could cause some problems later on. But that could be solved by appending a root tag to the beginning and end of the string. – DontBe3Greedy Feb 16 '20 at 02:06
  • 2
    The name of the ElementTree function is `fromstring` (all lower-case). And if the XML content is ill-formed, parsing it as a string does not solve the problem. – mzjn Feb 16 '20 at 07:40