0

I'm trying to convert:

<doc id="123" url="http://url.org/thing?curid=123" title="title"> 
Title

text text text more text

</doc>

into a CSV file (the file has a ton of similarly formatted "documents"). If it was a regular XML file I think I would be able to figure it out with a solution like this but since the above code is not in regular XML format I'm stuck.

What I'm trying to do is import data into postgresql, and from what I gather it would be easier to import this information if it's in CSV format, (if there's another way, please let me know). What I need is to separate out "id", "url" "title" and "text/body".

Bonus question: the first line in the text/body is the title of the document, would it be possible to remove/manipulate this first line in the conversion?

Thanks!

Community
  • 1
  • 1
govn
  • 11
  • 3
  • 3
    This part is not clear: "*the file has a ton of similarly formatted "documents"*". If it has more than one `doc` element, show us an example that has at least two, including the wrapper. And show us the expected output too, please. – michael.hor257k Jul 11 '15 at 20:24
  • A_A's answer pretty much covered everything. The file has a bunch of these doc elements, isolated with no wrapping. I added the necessary code with `echo ' ' | cat - app > temp && mv temp app && echo '' >> app` (where app is the file name). – govn Jul 12 '15 at 09:13
  • If there's no wrapper (i.e. a single root element), then it's not XML. – michael.hor257k Jul 12 '15 at 10:25

1 Answers1

1

As far as Python is concerned:

Given an XML file (thedoc.xml) like:

<?xml version="1.0" encoding="UTF-8"?>
<docCollection>
    <doc id="123" url="http://url.org/thing?curid=123" title="Farenheit451"> 
    Farenheit451

    It was a pleasure to burn...
    </doc>

    <doc id="456" url="http://url.org/thing?curid=456" title="Sense and sensitivity"> 
    Sense and sensitivity

    It was sensibile to be sensitive &amp; nice...
    </doc>        
</docCollection>

And a script (thecode.py) using lxml, as:

from lxml import etree
import pandas
import HTMLParser 

inFile = "./thedoc.xml"
outFile = "./theprocdoc.csv"

#It is likely that your XML might be too big to be parsed into memory,
#for this reason it is better to use the incremental parser from lxml.
#This is initialised here to be triggering an "event" after a "doc" tag
#has been parsed.
ctx = etree.iterparse(inFile, events = ("end",), tag=("doc",))

hp = HTMLParser.HTMLParser()
csvData = []
#For every parsed element in the "context"...
for event, elem in ctx:
    #...isolate the tag's attributes and apply some formating to its text
    #Please note that you can remove the cgi.escape if you are not interested in HTML escaping. Please also note that the body is simply split at the newline character and then rejoined to ommit the title.
    csvData.append({"id":elem.get("id"),
                    "url":elem.get("url"),
                    "title":elem.get("title"),
                    "body":hp.unescape("".join(elem.text.split("\n")[2:]))})
    elem.clear() #It is important to call clear here, to release the memory occupied by the element's parsed data.

#Finally, simply turn the list of dictionaries to a DataFrame and writeout the CSV. I am using pandas' to_csv here for convenience.
pandas.DataFrame(csvData).to_csv(outFile, index = False)

It will generate a CSV (theprocdoc.csv) that looks like:

body,id,title,url
        It was a pleasure to burn...    ,123,Farenheit451,http://url.org/thing?curid=123
        It was sensibile to be sensitive...    ,456,Sense and sensibility,http://url.org/thing?curid=456

For more information (and since I cannot format links within inline comments) please see lxml.etree.iterparse, cgi.escape, pandas.DataFrame.to_csv.

Hope this helps.

A_A
  • 2,326
  • 18
  • 27
  • This solves the problem to the last detail, and it even has the bonus question answered perfectly. Thanks a bunch for the quick reply! – govn Jul 12 '15 at 09:07
  • Thank you, glad you found the response helpful. – A_A Jul 12 '15 at 15:48
  • It turns out there is actually one thing that I can't get to work here, the cgi.escape(I think). When it hits a & or an < the program stops. For the &s I could easily manually replace them with `&` but for the <'s I can't, since they are used in the and so on. So in order to get this to work in all cases I think something has to be done with the `"body":cgi.escape(""` part. Thanks again for your help! – govn Jul 12 '15 at 19:07
  • Please see amended response, have added an `&` in the content which is unescaped to the corresponding character in the CSV. – A_A Jul 12 '15 at 21:13
  • Again, thanks for help here, unfortunately I can't get this to work. Python will still throw an error when trying to convert <, > and &. For example if I try to run the program through this: `in most cases 5 is < than 6 & 7` it throws an error. The &'s i can search and replace with `&`, but this wont work for the <'s, since they are used in the XML formating and so on. I may be missing something very basic here, but I just can't figure it out! If the program would just ignore <>& in the body"", or (even better) convert them to < > and & it would work. – govn Jul 13 '15 at 15:09
  • No worries. As I mention in my earlier message, these symbols should have been escaped properly during the "making-of" the `doc` elements. In essence, the python script "breaks" BEFORE it starts the main processing and specifically when it is parsing the XML in which case random `<` or `>` confuse the tag parsing. Can you re-export the `doc`s with proper escaping? – A_A Jul 13 '15 at 15:14