0

When I pass payload to a method that coverts an object to json, it removes namespaces from elements. I want to retain the namespaces in the serialized json object.

INPUT HTML FILE

<?xml version="1.0" encoding="UTF-8"?><html lang="en">
<head>
<title>jahaahahjjajajajajjajaja</title>
</head>
<body id="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem"><a name="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem"><!-- --></a>
<main role="main"><article role="article" aria-labelledby="ariaid-title1">
    <h1 class="title topictitle1" id="ariaid-title1">jahaahahjjajajajajjajaja</h1>

    
    <content class="body conbody"><p class="shortdesc">Overview of the full tool chain for jahaahahjjajajajajjajaja UA content development. Describes the
        purpose of each tool and its intended end user.</p>

        <p class="p">The jahaahahjjajajajajjajaja User Assistance ecosystem is being updated to employ modern tools for
            structured content development, management, and delivery. The new tool chain combines
            several tools that enable the jahaahahjjajajajajjajaja information developer to create, publish, and
            maintain jahaahahjjajajajajjajaja UA content. </p>

        <p class="p">The new tools are grouped by function, enabling you to  <a class="xref" href="#c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__section_gqw_vkq_lgb">develop,</a>
            <a class="xref" href="#c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__section_btp_xkq_lgb">review,</a>
            <a class="xref" href="#c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__section_evf_zkq_lgb">manage,</a> and <a class="xref" href="#c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__section_bmm_1lq_lgb">deliver</a> consistent, accurate, and personalized UA content to
            jahaahahjjajajajajjajaja customers.</p>

        <p class="p">The new tools are shown in the diagram below, and explained more thoroughly in the
            Writer's Toolbox documentation.</p>

        <figure class="fig fignone" id="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__fig_j4y_qby_lgb"><a name="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__fig_j4y_qby_lgb"><!-- --></a>
            <a name="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__image_pl2_pc4_kgb"><!-- --></a>
            <ac:image xmlns:ac="urn:ac" xmlns:ri="urn:ri" xmlns:mf="urn:mf" id="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__image_pl2_pc4_kgb"><ri:attachment ri:filename="g_tool_chain.jpg"/></ac:image>
        </figure>

        <section class="section" id="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__section_gqw_vkq_lgb"><a name="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__section_gqw_vkq_lgb"><!-- --></a><h2 class="title sectiontitle">Content Development</h2>
            
            <p class="p">jahaahahjjajajajajjajaja is authoring content in the Darwin Information Typing Architecture (jahaahahjjajajajajjajaja), a
                technical communications XML standard, and thus requires a jahaahahjjajajajajjajaja-compliant XML
                Editor. jahaahahjjajajajajjajaja has chosen the jahaahahjjajajajajjajaja tool set for   to creating its UA content in
                jahaahahjjajajajajjajaja XML.</p>

            <dl class="dl">
                
                    <dt class="dt dlterm">jahaahahjjajajajajjajaja Editor</dt>

                    <dd class="dd"> jahaahahjjajajajajjajaja Editor is a desktop editor that should be used by any information
                        developer whose main job is to create UA content.</dd>

                
                
                    <dt class="dt dlterm">jahaahahjjajajajajjajaja Web Author</dt>

                    <dd class="dd"> jahaahahjjajajajajjajaja Web Author is a browser-based editor that should be used by any
                        content contributor, such as a Subject Matter Expert (SME), who does not
                        write full-time and does not typically have the need nor desire to learn
                        jahaahahjjajajajajjajaja XML.</dd>

                
            </dl>

        </section>

        <section class="section" id="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__section_btp_xkq_lgb"><a name="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__section_btp_xkq_lgb"><!-- --></a><h2 class="title sectiontitle">Content Review</h2>
            
            <p class="p">Because jahaahahjjajajajajjajaja is a topic-based architecture, jahaahahjjajajajajjajaja needs a review platform that is
                both lightweight and allows for topic-based reviews, as opposed to reviews of full
                books or chapters. jahaahahjjajajajajjajaja's jahaahahjjajajajajjajaja platform meets these requirements and
                will be the main platform for reviewing UA content.</p>

            <dl class="dl">
                
                    <dt class="dt dlterm">jahaahahjjajajajajjajaja</dt>

                    <dd class="dd">
                        <p class="p">The jahaahahjjajajajajjajaja platform has two components: an "add-on" that is part
                            of the jahaahahjjajajajajjajaja Editor desktop application, and a web interface where
                            reviewers can add their comments and even make changes.</p>

                        <p class="p">The add-on is used by content owners to put their topics into review, get
                            a URL, and share the URL with chosen content reviewers.</p>

                    </dd>

                
            </dl>

        </section>

        <section class="section" id="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__section_evf_zkq_lgb"><a name="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__section_evf_zkq_lgb"><!-- --></a><h2 class="title sectiontitle">Content Management</h2>
            
            <p class="p">jahaahahjjajajajajjajaja UA content will be stored centrally in a Git repository, Bitbucket, and
                managed locally with the SourceTree client application. Working copies of content
                will reside on client (local) machines and be pushed to the shared repository when
                ready to be shared. </p>

            <dl class="dl">
                
                    <dt class="dt dlterm">Bitbucket</dt>

                    <dd class="dd">Bitbucket is a Git repository that provides jahaahahjjajajajajjajaja UA a central, shared
                        repository for content. Its main interface is a browser-based web interface,
                        although it can also be accessed via command line and desktop applications
                        such as SourceTree. jahaahahjjajajajajjajaja authors will use Bitbucket web client to
                        collaborate with one another on the shared repository. </dd>

                
                
                    <dt class="dt dlterm">SourceTree</dt>

                    <dd class="dd">SourceTree is a client application that connects to Git repositories.
                        jahaahahjjajajajajjajaja authors will use SourceTree to manage both remote and local versions
                        of their content. Because it is a client application, SourceTree has the
                        advantage of being able to track activity at the local level. </dd>

                
                
                    <dt class="dt dlterm">File Explorer</dt>

                    <dd class="dd">Windows Explorer (Windows) or Finder (Mac) will be used by jahaahahjjajajajajjajaja authors
                        to store and organize local versions of their content before pushing to the
                        shared repository.</dd>

                
            </dl>

        </section>

        <section class="section" id="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__section_bmm_1lq_lgb"><a name="c_jahaahahjjajajajajjajaja_ua_tools_ecosystem__section_bmm_1lq_lgb"><!-- --></a><h2 class="title sectiontitle">Content Delivery</h2>
            
            <p class="p">jahaahahjjajajajajjajaja's jahaahahjjajajajajjajaja content will be published through the open source jahaahahjjajajajajjajaja Open Toolkit
                (jahaahahjjajajajajjajaja-OT). The jahaahahjjajajajajjajaja-OT will be kicked off via the jahaahahjjajajajajjajaja Editor interface.</p>

            <dl class="dl">
                
                    <dt class="dt dlterm">jahaahahjjajajajajjajaja Open Toolkit</dt>

                    <dd class="dd">The jahaahahjjajajajajjajaja-OT transforms jahaahahjjajajajajjajaja XML to different formats for consumption by a
                        customer. jahaahahjjajajajajjajaja will use the jahaahahjjajajajajjajaja-OT to produce PDF, WebHelp, Word, and
                        CHM formats.</dd>
                
            </dl>

        </section>

    </content>

</article></main></body>
</html>

Python code that reads the HTML file and retrieves the element. It then creates a JSON string.

import json
import xml.etree.ElementTree as ET
class Page:
    def __init__(self, type, title, space, body):
        self.type = type
        self.title = title
        self.space = space
        self.body = body
        
    def getPageTitle(self):
        return self.title

    def getType(self):
        return self.type

    def getContent(self):
        return self.content

    def getJSONObject(self):
        jsonobj = json.dumps(self.__dict__)
        return jsonobj

class childPage(Page):
    def __init__(self, type, title, ancestors, space, body):
        self.type = type
        self.title = title
        self.ancestors = ancestors
        self.space = space
        self.body = body


def getContent(file):
        
        tree=ET.parse(file)
        root=tree.getroot()
        title2 = findTitle(root)
        body2 = findContent(root)
        print(body2)
        return title2, body2

def findTitle(root):
    for e in root.findall('head'):
        title3 = e.find('title').text
        return title3

def findContent(root):
    for e in root.findall('body'):
        body3 = e.find('main/article/content')
        return ET.tostring(body3).decode("utf-8")

title, value = getContent("test.html")
space = {"key": "TOOL"}
ancestors = [{"id":245}]
body = {"storage":{"value":value, "representation":"storage"}}
pageob = childPage("page", title, ancestors, space, body)
print (pageob.getJSONObject())

This code works. But when decoding the bytes objects, the namespaces gets stripped and replaced with unintended characters.

I am not a professional developer. Please forgive any mistakes in the code. Could you please help me fix this? Thank you in advance.

Community
  • 1
  • 1
Antony
  • 115
  • 1
  • 11
  • Could you show more code snip from input to output? Just json.dumps may not cause this. – waynelpu Apr 04 '19 at 05:32
  • Your createChildPage code still cant regenerate problem for me, I guess the real problemis is you share the dictionary object in code and replace the content other places. – waynelpu Apr 04 '19 at 07:08
  • @waynelpu do you want me to post more details? I copied only snippets. Basically, if the value of the key "value" includes a tag with the ac:name space (ac:image for example), it replaces ac with ns0. Only that portion of the code is not working. I am getting the desired output otherwise in terms of structure. – Antony Apr 04 '19 at 07:11
  • Yeah, the smallest problem reproduce code is better – waynelpu Apr 04 '19 at 07:18
  • @waynelpu sorry for the trouble. I have posted a working piece of code. You should be able run this code and reproduce the issue. – Antony Apr 04 '19 at 07:59
  • the code output(body -> storage -> value) contain exactly what value is in input. So as i guess above, it may not the problem in json.dumps but in other pieces of code. You can run the code and check the namespace in output is not be modified – waynelpu Apr 04 '19 at 08:13
  • @waynelpu, the output is not the same as the input in some cases. The output in some cases contain If I delete some parts of the input string, I get the desired output. Something in the string. But unable to pin point what exactly is causing the issue. – Antony Apr 04 '19 at 08:24
  • Could you give some input cause the problem? the input in current post not cause any problem now – waynelpu Apr 04 '19 at 08:32
  • I think the problem occurs much before the serialization. Actually, the input data is retrieved from HTML pages. When I decode the data from a node to string, it is causing the issue. I have posted the remaining part. – Antony Apr 04 '19 at 08:49
  • @waynelpu I edited the post again - included the html content as well. – Antony Apr 04 '19 at 09:04
  • I use "assert value in json.loads(pageob.getJSONObject())['body']['storage']['value']" at the end of code to check if value is same as the output and it pass. It means the code above still not reproduct the problem. – waynelpu Apr 04 '19 at 09:29
  • @waynelpu, are you reading the html file? I guess the problem is here. ```def findContent(root): for e in root.findall('body'): body3 = e.find('main/article/content') return ET.tostring(body3).decode("utf-8")``` Please consider the HTML content as the input. In the HTML content, there are elements with the name space ac:. If the element is retrieved, the namespaces are replaced. So the problem is not with the JSON object. – Antony Apr 04 '19 at 09:36
  • @waynelpu thank you so much for your time. I found a solution to the problem here https://stackoverflow.com/questions/54439309/how-to-preserve-namespaces-when-parsing-xml-via-elementtree-in-python We need to register namespaces when using Elementtree. – Antony Apr 04 '19 at 10:03

1 Answers1

0

When I register the namespaces, the problem goes away. I found an answer here: How to preserve namespaces when parsing xml via ElementTree in Python

Antony
  • 115
  • 1
  • 11