2

input:the html

<SPAN id=idxSpan><OBJECT id=IndiDocX codeBase="/IndiDocX.CAB#version=4,5,0,132" classid=clsid:43B180A2-396A-45CE-86D1-9680E4A9952C width=500 height=201 VIEWASTEXT><PARAM NAME="_ExtentX" VALUE="13229"><PARAM NAME="_ExtentY" VALUE="5318"><PARAM NAME="BackColor" VALUE="0"><PARAM NAME="ForeColor" VALUE="0"><PARAM NAME="Enabled" VALUE="True"><PARAM NAME="BackStyle" VALUE="0"><PARAM NAME="BorderStyle" VALUE="0"><PARAM NAME="iWidth" VALUE="800"><PARAM NAME="iHeight" VALUE="200"><PARAM NAME="MainDocUNID" VALUE="AAB092D735084A064825852D00372312"><PARAM NAME="ServerIP" VALUE="zboa3.sinopec.com"><PARAM NAME="DbPath" VALUE="/sinopec4/dep4809/swgl_4809.nsf"><PARAM NAME="DocForm" VALUE="frmIndiDocs"><PARAM NAME="FileInfos" VALUE="<!1!>6E09605B0382AFBA482585290033D844<file_unid>QFGG0L8JG5XF4C389PW5</file_unid><file_name>关于印发《党内关怀帮扶实施细则》的通〔2019〕67号 ).sep</file_name><file_size>42315</file_size><file_create>2020-3-12 17:34:23</file_create><file_update>2020-3-12 17:34:23</file_update><file_editmodel>0</file_editmodel><doc_unid>4825795A000CAA904825852D0001DA87</doc_unid></!1!><!2!>6E09605B0382AFBA482585290033D844<file_unid>NM6NEGOCXG5PSBMGFVMQ</file_unid><file_name>公司党内关怀帮扶实施标准.docx</file_name><file_size>20581</file_size><file_create>2020-3-12 17:34:26</file_create><file_update>2020-3-12 17:34:26</file_update><file_editmodel>0</file_editmodel><doc_unid>4825795A000CAA904825852D0001DAB0</doc_unid></!2!>
<!3!>6E09605B0382AFBA482585290033D844<file_unid>6M0ZGTE3H0FH4PN9QBT0</file_unid><file_name>公司发〔2020〕19号关于转发《关于印发〈党内关怀帮扶实施细则〉的通知》的通知.pdf</file_name><file_size>95471</file_size><file_create>2020-3-16 18:6:48</file_create><file_update>2020-3-16 18:6:48</file_update><file_editmodel>0</file_editmodel><doc_unid>4825795A000CAA904825852D0036ECE1</doc_unid></!3!>"
><PARAM NAME="Editable" VALUE="True"><PARAM NAME="WordTrack" VALUE="True"><PARAM NAME="WordLock" VALUE="True"><PARAM NAME="UpdInfoDocID" VALUE="4825795A000CAA9048258529003409BE"><PARAM NAME="SessionID" VALUE="554C571D3511CF5D338DC83F58767F21"><PARAM NAME="FileNum" VALUE="0"><PARAM NAME="FileNames" VALUE=""><PARAM NAME="FileSelNames" VALUE=""><PARAM NAME="LockForm" VALUE="True"><PARAM NAME="IsShowTrack" VALUE="True"><PARAM NAME="MenuValue" VALUE="11110000"><PARAM NAME="CanUseHandMark" VALUE="1"><PARAM NAME="CanHandMarkFile" VALUE="1"><PARAM NAME="CanClearHandMarkFile" VALUE="1"><PARAM NAME="HandMarkFileWidth" VALUE="6"><PARAM NAME="CanChangeHandMarkFile" VALUE="1"><PARAM NAME="Version" VALUE="V12"><PARAM NAME="WebServerVersion" VALUE="379"><PARAM NAME="EngFileName" VALUE="true"></OBJECT></SPAN>

I need a funtion let it output format like it:

[{'file_name':**value of <file_name>**,'url':http://server.com/**value of <doc_unid>**/$file/**value of <file_unid>****value of <file_name> ext part**}]

I think is bad code and can't get the result. I user bs4 like this:

soup = BeautifulSoup(string_html, 'lxml', exclude_encodings='utf-8')
data = soup.find('param', attrs={'name': 'FileInfos'})['value']
soup_data = BeautifulSoup(data, 'lxml', exclude_encodings='utf-8')
for n in soup_data.find_all(name=['doc_unid','file_unid','file_name']):
        print(n.doc_unid)

why it can not work??

html4 = re.sub(r'(\<)(/?)\!(\d+\!)', r'<\g<2>li', html)
soup = BeautifulSoup(html4, 'lxml')
data = soup.find('param', attrs={'name': 'FileInfos'})['value']
data1 = '<ul>' + data + '</ul>'
soup_data = BeautifulSoup(data1, 'lxml')
for n in soup_data.children:
    print(n.doc_unid.string)

why only one data?

JasonYun
  • 999
  • 2
  • 11
  • 15
  • 3
    A regex fundamentally can’t parse HTML, and is the wrong tool here. Use an HTML parser (e.g. BeautifulSoup). – Konrad Rudolph Mar 16 '20 at 12:35
  • part of html and i need value is PARAM NAME="FileInfos" VALUE= – JasonYun Mar 16 '20 at 12:37
  • Nevertheless, using an HTML parser vastly simplifies this problem. Even HTML fragments are in general not describable using regex; and when they are, the corresponding regex tends to be complex and brittle. – Konrad Rudolph Mar 16 '20 at 12:45
  • I am no idea can give me the way – JasonYun Mar 16 '20 at 12:46
  • Citing the first comment: *"Use an HTML parser (e.g. BeautifulSoup)"*. There is almost infinite material and examples and documentation on how to set this up and use it, if you look for it. – Tomalak Mar 16 '20 at 12:52
  • Does this answer your question? [Parsing HTML using Python](https://stackoverflow.com/questions/11709079/parsing-html-using-python) – Toto Mar 16 '20 at 14:10

2 Answers2

1

I don't know how useful or relevant this is but you could give it a try:

import bs4

def getInfo(text,section):
   start = text.index("<"+section+">")+len(section)+2
   end = text.index("</"+section+">")
   return text[start:end]

def format(htmlCode):
   results = {}
   results["file_name"] = getInfo(htmlCode,"file_name")
   results["url"] = "http://server.com/"+getInfo(htmlCode,"file_unid")
   return results
Skalex
  • 136
  • 2
  • 10
0

Part of your problem is the value string in FileInfos is not actually parseable XML. See the parts starting with <!? Those are Markup Declaration Open, which effectively "comment out" that part. It would appear that each tag under VALUE is escaped in this numbered-tag notation.

If you take the <!n!> open and close tags and replace with corresponding <li> and wrap the whole block in <ul></ul>, then you can parse that section as normal xml. You can use a regex like \</?\!(.)\!\> to locate your MDO tags.

Here's an example python snippet:

example = '<!1!>stuff</!1!><!2!>things</!2!>'
re.sub(r'(\<)(/?)\!(\d+\!)', r'<\g<2>li', example)
'<li>stuff</li><li>things</li>'

You'll need to add the ul tags but this should get you most of the way there.

Here's what the entire output should look like once you've processed:

<ul>
<li>6E09605B0382AFBA482585290033D844
    <file_unid>QFGG0L8JG5XF4C389PW5</file_unid>
    <file_name>关于印发《党内关怀帮扶实施细则》的通〔2019〕67号 ).sep</file_name>
    <file_size>42315</file_size>
    <file_create>2020-3-12 17:34:23</file_create>
    <file_update>2020-3-12 17:34:23</file_update>
    <file_editmodel>0</file_editmodel>
    <doc_unid>4825795A000CAA904825852D0001DA87</doc_unid>
</li>
<li>6E09605B0382AFBA482585290033D844
    <file_unid>NM6NEGOCXG5PSBMGFVMQ</file_unid>
    <file_name>公司党内关怀帮扶实施标准.docx</file_name>
    <file_size>20581</file_size>
    <file_create>2020-3-12 17:34:26</file_create>
    <file_update>2020-3-12 17:34:26</file_update>
    <file_editmodel>0</file_editmodel>
    <doc_unid>4825795A000CAA904825852D0001DAB0</doc_unid>
</li>
<li>6E09605B0382AFBA482585290033D844
    <file_unid>6M0ZGTE3H0FH4PN9QBT0</file_unid>
    <file_name>公司发〔2020〕19号关于转发《关于印发〈党内关怀帮扶实施细则〉的通知》的通知.pdf</file_name>
    <file_size>95471</file_size>
    <file_create>2020-3-16 18:6:48</file_create>
    <file_update>2020-3-16 18:6:48</file_update>
    <file_editmodel>0</file_editmodel>
    <doc_unid>4825795A000CAA904825852D0036ECE1</doc_unid>
</li>

DeusXMachina
  • 1,239
  • 1
  • 18
  • 26