0

I'm building RSS feed service, I'm dealing with articles which have unique format like this, I just want to fetch the content, not xml and particular styles or settings, I tried remove image base64 and strip tags and trim multiple spaces, but still there are a lot of weird content right there, how do I sanitize the data so I just get plain text This is paragraph text long content, Another paragraph text long content

<p align="justify"><!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:View>Normal</w:View>
  <w:Zoom>0</w:Zoom>
  <w:TrackMoves></w:TrackMoves>
  <w:TrackFormatting></w:TrackFormatting>
  ...
  </xml><![endif]--><!--[if gte mso 9]><xml>
 <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
  DefSemiHidden="true" DefQFormat="false" DefPriority="99"
  LatentStyleCount="267">
  <w:LsdException Locked="false" Priority="0" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Normal"></w:LsdException>
  <w:LsdException Locked="false" Priority="9" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="heading 1"></w:LsdException>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"></w:LsdException>
</xml><![endif]--><!--[if gte mso 10]>
<style>
 /* Style Definitions */
 table.MsoNormalTable
    {mso-style-name:"Table Normal";
    mso-tstyle-rowband-size:0;
    mso-tstyle-colband-size:0;
    mso-style-noshow:yes;
mso-bidi-theme-font:minor-bidi;}
</style>
<![endif]-->

<p class="MsoNormal" align="justify">**This is paragraph text long content**</p><p class="MsoNormal" align="justify"> </p><br>

<p class="MsoNormal" align="justify">**Another paragraph text long content**</p>
Angga Ari Wijaya
  • 1,759
  • 1
  • 15
  • 31
  • hmm, I don't think so, I want to remove those XML and unnecessary tags, I do not fetch data from XML itself, because the data is messy, that is article which produced from WYSIWYG then I want to get the summary by trimming 160 characters from the beginning of that article. – Angga Ari Wijaya Aug 15 '16 at 04:35
  • Oh i found it, tool that can use to extract it from that [How do you parse and process HTML/XML in PHP?](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) – Angga Ari Wijaya Aug 15 '16 at 04:50

1 Answers1

0

Part of my question was answered at How do you parse and process HTML/XML in PHP

Extract messy and unwell formatted HTML content can use Simple HTML DOM Parser or relevant script tools.

Thanks

Community
  • 1
  • 1
Angga Ari Wijaya
  • 1,759
  • 1
  • 15
  • 31