Say,
I have a html file from Word (DOCX) generated by soffice --headless
command. Then I did tidy
command so that the html to looks clean by removing unnecessary html/css cosmetics from Word.
I see something like
<p lang="en-US" class="western c31"></p>
<p lang="en-US" class="western c31"></p>
<p lang="en-US" class="western c31"></p>
<p lang="en-US" class="western c31"></p>
<p lang="en-US" class="western c31"></p>
<p lang="en-US" class="western c31"></p>
<p lang="en-US" class="western c31"></p>
<p lang="en-US" class="western c31"></p>
<p lang="en-US" class="western c31"></p>
... repeated 15 times
I did these command:
sed -e 's/<(.*?)><\/(.?)>//g' > ./hasil.html
sed -e 's/<[a-z] lang="(.*) class="western (.*?)><\/[a-z]>//g' > ./hasil.html
It doesn't work as expected to remove <p lang="en-US" class="western c31"></p>
from HTML file.
I tried this link or this link, but doesn't help either.
Any help would be appreciate. Thank you.