0

I am looking for matches in the following string of text:

'<html xmlns:msdt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:mso="urn:schemas-microsoft-com:office:office">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   SN G2250-010\n  </title>\n  <!--[if gte mso 9]><xml>\n<mso:CustomDocumentProperties>\r\n<mso:Service_x0020_Note msdt:dt="string">SN</mso:Service_x0020_Note>\r\n<mso:Order msdt:dt="string">1493700.00000000</mso:Order>\r\n<mso:ContentType msdt:dt="string">Document</mso:ContentType>\r\n</mso:CustomDocumentProperties>\n</xml><![endif]-->\n </head>\n <link href="..\\..\\_format.css" rel="stylesheet" type="text/css"/>\n <body>\n  <table>\n   <tr>\n    <td>\n     <img border="0" src="SN_G2250_010//r1_logo1.gif"/>\n    </td>\n    <td align="left" width="178">\n     <img border="0" src="SN_G2250_010//r1_logo2.gif"/>\n    </td>\n    <td>\n     <div class="subtitle2">\n      <b>\n       <font color="red">\n        Life Sciences and Chemical Analysis Service Note\n       </font>\n      </b>\n     </div>\n    </td>\n   </tr>\n  </table>\n  <h2>\n   SERVICE NOTE G2250-010\n  </h2>\n  <pre>Supersedes: None\r\n \r\nINB22000 compatibility with Windows 2000 and ChemStation A.9.01\r\n\r\nSerial Numbers:\r\nUS00000000 - US99999999\r\n\r\nThe CCMode software is in general compatible with Windows 2000 and \r\nChemStation Revision A.9.01. Please see required settings!\r\n\r\nTo Be Performed By:\r\nAgilent-Qualified Personnel\r\n\r\nParts Required:\r\n\r\nNone\r\n\r\nSituation:\r\nChanges of operating software to Windows 2000 and implementation\r\nof ChemStation Rev. A.9.01 required some testing of the CCMode \r\n\r\nsoftware INB22000 / INB22002 / INB22003 and INB22004 Rev. A.03.02.\r\n\r\nSolution/Action:\r\nBefore using the Micro-plate Sampling Software INB22000 / INB22002 \r\n/ INB22003 or INB22004 Rev. A.03.02 (CCMode)  on a PC with \r\nWindows 2000 a minor change in the "Control panel" must be made. \r\nIf this change is not made some icons in the user interface will \r\nnot be represented correctly. The functionality itself is not \r\ninfluenced:\r\n\r\nOpen "Settings", "Control Panel", "Display", "Appearance".\r\n\r\nGo to the "Scheme" and select the choice "Windows Classic". \r\nPress "OK" and close the "Control Panel" window.Required "Regional \r\nSettings" for both WIN NT and WIN2000\r\n\r\nIn order to run and edit parameters within CC-Mode your \r\nPC must be setup in this way:\r\n\r\n- Regional settings: English (United States)\r\n- Number format (default for English (United States)) \r\n  Decimal symbol  \'.\'\r\n- Number format (default for English (United States)) \r\n  Digit grouping symbol  \',\'\r\n\r\nNotes about using WIN2000:\r\n\r\n1. The installation and operation of CCMode (A.03.0x) and \r\nPurify SW (A.01.01) on the same PC is not recommended and \r\nnot supported.\r\n\r\n2. CCMode A.03.01 has not been tested. Customers owning \r\nthis version must upgrade to A.03.02 even if the additional \r\nfeatures for preparative analysis are not needed.\r\n\r\n3. The combination CCmode A.03.0x, ChemStation A.08.0x and \r\nWindows 2000 has not been tested and is not supported.\r\n\r\n\r\n\r\nDate:\r\n3/11/02\r\n******************************************************************************\r\n\r\n*                              Information Only                             
*\r\n******************************************************************************\r\n*             Author/Entity: AG/B404                                         *\r\n*  Additional Information: None                                          
*\r\n******************************************************************************\r\n</pre>\n </body>\n</html>\n'

I define a raw string in Python 3.6.4:

r = r'Supersedes:?[\\r\\n ]+[\w\-\s]+[\\r\\n ]+(.*)[\\r\\n ]+Serial Numbers?:?[ \\r\\n]+.*?[ \\n\\r]\*+[\\n\\r ]+\*([A-Za-z ]+)[ \\n\\r]\*+[\\n\\r]+.*?\*+[ \\n\\r]+.*?\*\s+(?:Author[:\w\/]+ ([\.\w\/\s�]+))'

, which I then use to search:

a = re.search(r, raw_string, re.M|re.S)

This returns no matches:

a[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object is not subscriptable

Although the exact same string and regex match on regex101:

https://regex101.com/r/qgJMbO/1

Can anyone tell me what the problem could be?

Edit:

The expected outcome is:

a[1] `INB22000 compatibility with Windows 2000 and ChemStation A.9.01\r\n\r\

a[2] ' Information Only '

a[3] 'AG/B404 '

David J.
  • 1,753
  • 13
  • 47
  • 96
  • 2
    Don't parse html using regex – anubhava Jul 12 '18 at 13:41
  • First things first, using `r""` you shouldn't double escape metacharacters. – revo Jul 12 '18 at 13:42
  • And besides, you are using a different string in the regex tester. [This is your string](http://rextester.com/DXYAM2715). – Wiktor Stribiżew Jul 12 '18 at 13:42
  • What is the expected outcome? – johnashu Jul 12 '18 at 13:43
  • Off-topic but and extra tip for free, avoid using 1 letter variables, try to aim for a minimum of 3 letters. – Chuk Ultima Jul 12 '18 at 13:44
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) –  Jul 12 '18 at 13:47
  • That question isn't the same, but it has the answer you're looking for: "don't parse HTML with regex!!" Use an HTML parser. –  Jul 12 '18 at 13:49
  • @anubhava I'm only interested in what's between the
     tags.  That's not an xml/html string.
    – David J. Jul 12 '18 at 13:53
  • @johnashu I added the expected outcome to the question. – David J. Jul 12 '18 at 13:56
  • @JoshDetwiler How do you propose I parse the contents within the
     tags using an html parser?  That won't work here.
    – David J. Jul 12 '18 at 13:59
  • @revo I changed it to use single quotes. Getting the same result. – David J. Jul 12 '18 at 14:01
  • @JoshDetwiler How is this a duplicate? Everything I'm interested in parsing is contained within the
     tags and is not written in xml.  An xml parser won't help.
    – David J. Jul 12 '18 at 14:03
  • Did you read the comment I gave to it? It's not a duplicate question, but it has the answer you're looking for. HTML and XML are both DOM type structures. It's a famous answer, in fact. Give it a read and a chuckle. –  Jul 12 '18 at 14:04
  • @JoshDetwiler What's inside of the
     tag is neither xml nor html.  It's just a string.  I won't write out the entire contents, but basically it just goes on like `"Supersedes: None\r\n \r\nINB22000 compatibility with Windows 2000 and ChemStation A.9.01\r\n\r\nSerial Numbers:\r\nUS00000000 - US99999999\r\n\r\nThe CCMode software is in general compatible with Windows 2000 and \r\nChemStation Revision A.9.01. Please see required settings!\r\n\r\nTo Be Performed By:\r\nAgilent-Qualified Personnel\r\n\r\nParts Required:\r\n\r\nNo." ` etc.  How will an xml parser help me deal with this?
    – David J. Jul 12 '18 at 14:09
  • 1
    @JoshDetwiler Please either explain how this is a duplicate of your link, where someone is actually trying to parse an html string without a parser. Otherwise, please remove the duplicate tag. – David J. Jul 12 '18 at 14:10
  • One, it's not marked duplicate. It's a flag. A moderator will decide whether to close this or not. Two, I marked it a duplicate because it's essentially asking for the same you want. You want to parse HTML for the contents of a particular tag and then further parse it. This question's answer is simply...don't use regex for the HTML part. *That* is a duplicate. Third, your princess is in another castle! See my answer below a few minutes ago...I explained why you *can* use regex after using an HTML parser. @johnashu's answer is the one you want... –  Jul 12 '18 at 14:14
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174889/discussion-between-josh-detwiler-and-david-j). –  Jul 12 '18 at 14:16

1 Answers1

4

I have provided a solution using both BeautifulSoup and re

from bs4 import BeautifulSoup as bs4
import re

docstring = '<html xmlns:msdt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:mso="urn:schemas-microsoft-com:office:office">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   SN G2250-010\n  </title>\n  <!--[if gte mso 9]><xml>\n<mso:CustomDocumentProperties>\r\n<mso:Service_x0020_Note msdt:dt="string">SN</mso:Service_x0020_Note>\r\n<mso:Order msdt:dt="string">1493700.00000000</mso:Order>\r\n<mso:ContentType msdt:dt="string">Document</mso:ContentType>\r\n</mso:CustomDocumentProperties>\n</xml><![endif]-->\n </head>\n <link href="..\\..\\_format.css" rel="stylesheet" type="text/css"/>\n <body>\n  <table>\n   <tr>\n    <td>\n     <img border="0" src="SN_G2250_010//r1_logo1.gif"/>\n    </td>\n    <td align="left" width="178">\n     <img border="0" src="SN_G2250_010//r1_logo2.gif"/>\n    </td>\n    <td>\n     <div class="subtitle2">\n      <b>\n       <font color="red">\n        Life Sciences and Chemical Analysis Service Note\n       </font>\n      </b>\n     </div>\n    </td>\n   </tr>\n  </table>\n  <h2>\n   SERVICE NOTE G2250-010\n  </h2>\n  <pre>Supersedes: None\r\n \r\nINB22000 compatibility with Windows 2000 and ChemStation A.9.01\r\n\r\nSerial Numbers:\r\nUS00000000 - US99999999\r\n\r\nThe CCMode software is in general compatible with Windows 2000 and \r\nChemStation Revision A.9.01. Please see required settings!\r\n\r\nTo Be Performed By:\r\nAgilent-Qualified Personnel\r\n\r\nParts Required:\r\n\r\nNone\r\n\r\nSituation:\r\nChanges of operating software to Windows 2000 and implementation\r\nof ChemStation Rev. A.9.01 required some testing of the CCMode \r\n\r\nsoftware INB22000 / INB22002 / INB22003 and INB22004 Rev. A.03.02.\r\n\r\nSolution/Action:\r\nBefore using the Micro-plate Sampling Software INB22000 / INB22002 \r\n/ INB22003 or INB22004 Rev. A.03.02 (CCMode)  on a PC with \r\nWindows 2000 a minor change in the "Control panel" must be made. \r\nIf this change is not made some icons in the user interface will \r\nnot be represented correctly. The functionality itself is not \r\ninfluenced:\r\n\r\nOpen "Settings", "Control Panel", "Display", "Appearance".\r\n\r\nGo to the "Scheme" and select the choice "Windows Classic". \r\nPress "OK" and close the "Control Panel" window.Required "Regional \r\nSettings" for both WIN NT and WIN2000\r\n\r\nIn order to run and edit parameters within CC-Mode your \r\nPC must be setup in this way:\r\n\r\n- Regional settings: English (United States)\r\n- Number format (default for English (United States)) \r\n  Decimal symbol  \'.\'\r\n- Number format (default for English (United States)) \r\n  Digit grouping symbol  \',\'\r\n\r\nNotes about using WIN2000:\r\n\r\n1. The installation and operation of CCMode (A.03.0x) and \r\nPurify SW (A.01.01) on the same PC is not recommended and \r\nnot supported.\r\n\r\n2. CCMode A.03.01 has not been tested. Customers owning \r\nthis version must upgrade to A.03.02 even if the additional \r\nfeatures for preparative analysis are not needed.\r\n\r\n3. The combination CCmode A.03.0x, ChemStation A.08.0x and \r\nWindows 2000 has not been tested and is not supported.\r\n\r\n\r\n\r\nDate:\r\n3/11/02\r\n******************************************************************************\r\n\r\n*                              Information Only   *\r\n******************************************************************************\r\n*             Author/Entity: AG/B404                                         *\r\n*  Additional Information: None                                          *\r\n******************************************************************************\r\n</pre>\n </body>\n</html>\n'


soup = bs4(docstring, 'lxml')

description_source = soup.find('pre')

s = description_source.text

r = 'Supersedes:?[\\r\\n ]+[\w\-\s]+[\\r\\n ]+(.*)[\\r\\n ]+Serial Numbers?:?[ \\r\\n]+.*?[ \\n\\r]\*+[\\n\\r ]+\*([A-Za-z ]+)[ \\n\\r]\*+[\\n\\r]+.*?\*+[ \\n\\r]+.*?\*\s+(?:Author[:\w\/]+ ([\.\w\/\s�]+))'

a = re.search(r, s, re.M|re.S)

s = s.split('\r\n')

print(s[2])
print(a[2])
print(a[3])

Returns:

INB22000 compatibility with Windows 2000 and ChemStation A.9.01
                          Information Only  
AG/B404                                         
johnashu
  • 2,167
  • 4
  • 19
  • 44
  • I'm not trying to get the entire contents of the
     tag.  That would be easy enough.  I'm interested in extracting certain sections of it that are not written in xml/html.
    – David J. Jul 12 '18 at 13:55
  • 1
    @DavidJ. Then use an HTML parser to get to the whole `
    ` tag as suggested above and *then* use regular expressions to parse that string. Just use regex on search spaces that are appropriate.
    –  Jul 12 '18 at 14:02