3

I am doing mail parsing application which required to convert the HTML file to Plain Text. regarding this i have found some scripts which does conversion. I want to do same thing in C++.

So please suggest me any Cross platform and open source C++ libraries for converting HTML to Plain Text.

Thanks in advance Regards Subbi

Subbi Reddy K
  • 392
  • 1
  • 2
  • 14

5 Answers5

3

After trying out a few options, I think the easier way to do it with large scale is to use elinks.

in ubuntu:

sudo apt-get install elinks
elinks -dump a.html > a.txt
formatjam
  • 347
  • 1
  • 3
  • 13
1

As 'obvious' as it may sound you can just keep all the text between > & <

Eugen Constantin Dinca
  • 8,994
  • 2
  • 34
  • 51
  • I think you have misinterpreted my question. my desire is to convert HTML to Text. – Subbi Reddy K Mar 10 '10 at 04:45
  • @subbi : the HTML tags are enclosed between < & > so stripping them will give you the text: everything between > & <. Of course I'm oversimplifying it, you'll need to take care of a few special tags (i.e. – Eugen Constantin Dinca Mar 10 '10 at 05:58
1

I post the c++ version for Windows which originally came from @Ben Anderson's C# solution. Note, the code isn't quite robust yet. Also all the leading and ending newlines would be trimmed.

// The trimming method comes from https://stackoverflow.com/a/1798170/1613961
wstring trim(const std::wstring& str, std::wstring& newline = L"\r\n")
{
    const auto strBegin = str.find_first_not_of(newline);
    if (strBegin == std::string::npos)
        return L""; // no content

    const auto strEnd = str.find_last_not_of(newline);
    const auto strRange = strEnd - strBegin + 1;

    return str.substr(strBegin, strRange);
}

wstring HtmlToText(wstring htmlTxt) {

    std::wregex stripFormatting(L"<[^>]*(>|$)"); //match any character between '<' and '>', even when end tag is missing

    wstring s1 = std::regex_replace(htmlTxt, stripFormatting, L"");
    wstring s2 = trim(s1);
    wstring s3 = std::regex_replace(s2, std::wregex(L"\\&nbsp;"), L" ");
    return s3;
}
Jeff T.
  • 2,193
  • 27
  • 32
0

Try using regular expression extracting html tags and save result as file text. But it not simple. Use this help class DEELX - Regular Expression Engine.

lsalamon
  • 7,998
  • 6
  • 50
  • 63
0

Take a look at html2text. It's a command tool and not a pure lib, but contains code which strips and converts html. So you should be able to use it.

Martin Wickman
  • 19,662
  • 12
  • 82
  • 106