0

I want to reformat a string to the XML structure, but my string is not on an XML format (using Python 2.7). I believe the correct way is to first create an XML format of the input in one line and then use XML Pretty Print for making it an XML file with multi rows and indentation ( Pretty printing XML in Python).

Below there is an example of an input after a History Server REST API's call to Hadoop server 1. Input:

'{"jobAttempts":{"jobAttempt":[{"nodeHttpAddress":"slave2:8042","nodeId":"slave2:39637","id":1,"startTime":1544691730439,"containerId":"container_1544631848492_0013_01_000001","logsLink":"http://23.22.43.90:19888/jobhistory/logs/slave2:39637/container_1544631848492_0013_01_000001/job_1544631848492_0013/hadoop2"}]}}' 

Output:

'<jobAttempts><jobAttempt><nodeHttpAddress>slave2:8042</nodeHttpAddress><nodeId>slave2:39637</nodeId><id>1</id><startTime>1544691730439</startTime><containerId>container_1544631848492_0013_01_000001</containerId><logsLink>http://23.22.43.90:19888/jobhistory/logs/slave2:39637/container_1544631848492_0013_01_000001/job_1544631848492_0013/hadoop2</logsLink></jobAttempt></jobAttempts>' 

Final Output

<jobAttempts>
    <jobAttempt>
        <nodeHttpAddress>slave2:8042</nodeHttpAddress>
        <nodeId>slave2:39637</nodeId>
        <id>1</id>
        <startTime>1544691730439</startTime>
        <containerId>container_1544631848492_0013_01_000001</containerId>
    <logsLink>http://23.22.43.90:19888/jobhistory/logs/slave2:39637/container_1544631848492_0013_01_000001/job_1544631848492_0013/hadoop2</logsLink>
    </jobAttempts>
</jobAttempt>

*This string is actually an XML file which does not appear to have any style information associated with it.

Community
  • 1
  • 1
Or Raz
  • 39
  • 2
  • 11

1 Answers1

0

I have found out that the source view of the History Server REST API's is indeed an XML file in one line. Thus, I had to read the source view and not the old problematic view with python. Before I used

import urllib2
contents = urllib2.urlopen("http://http://23.22.43.90:19888/ws/v1/history/mapreduce/jobs/job_1544631848492_0013//jobattempts").read()

Now, I am downloading the source view of the html page with selenium and BeautifulSoup and I save it locally.

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import xml.dom.minidom
driver = webdriver.Firefox()
driver.get("http://23.22.43.90:19888/ws/v1/history/mapreduce/jobs/job_1544631848492_0013/jobattempts")
page_source = driver.page_source
driver.close()
soup = BeautifulSoup(page_source, "html.parser")
print(soup)
xml = xml.dom.minidom.parseString(str(soup))
pretty_xml_as_string = xml.toprettyxml()
file = open("./content_new_2.xml", 'w')
file.write(pretty_xml_as_string)
file.close()
Or Raz
  • 39
  • 2
  • 11