Convert HTML to Plain Text using c++

Question

I am doing mail parsing application which required to convert the HTML file to Plain Text. regarding this i have found some scripts which does conversion. I want to do same thing in C++.

So please suggest me any Cross platform and open source C++ libraries for converting HTML to Plain Text.

Thanks in advance Regards Subbi

What do you mean with "convert"? Html IS plain text! Do you want to strip the html tags? — neverlord, Mar 09 '10 at 15:04

score 3 · Answer 1 · answered Nov 27 '11 at 00:45

3

After trying out a few options, I think the easier way to do it with large scale is to use elinks.

in ubuntu:

sudo apt-get install elinks
elinks -dump a.html > a.txt

answered Nov 27 '11 at 00:45

formatjam

347
1
3
13

score 1 · Answer 2 · answered Mar 09 '10 at 16:48

1

As 'obvious' as it may sound you can just keep all the text between > & <

answered Mar 09 '10 at 16:48

Eugen Constantin Dinca

8,994
2
34
51

I think you have misinterpreted my question. my desire is to convert HTML to Text. – Subbi Reddy K Mar 10 '10 at 04:45
@subbi : the HTML tags are enclosed between < & > so stripping them will give you the text: everything between > & <. Of course I'm oversimplifying it, you'll need to take care of a few special tags (i.e. – Eugen Constantin Dinca Mar 10 '10 at 05:58

Jeff T. · Answer 3 · 2018-01-31T07:35:22.507

I post the c++ version for Windows which originally came from @Ben Anderson's C# solution. Note, the code isn't quite robust yet. Also all the leading and ending newlines would be trimmed.

// The trimming method comes from https://stackoverflow.com/a/1798170/1613961
wstring trim(const std::wstring& str, std::wstring& newline = L"\r\n")
{
    const auto strBegin = str.find_first_not_of(newline);
    if (strBegin == std::string::npos)
        return L""; // no content

    const auto strEnd = str.find_last_not_of(newline);
    const auto strRange = strEnd - strBegin + 1;

    return str.substr(strBegin, strRange);
}

wstring HtmlToText(wstring htmlTxt) {

    std::wregex stripFormatting(L"<[^>]*(>|$)"); //match any character between '<' and '>', even when end tag is missing

    wstring s1 = std::regex_replace(htmlTxt, stripFormatting, L"");
    wstring s2 = trim(s1);
    wstring s3 = std::regex_replace(s2, std::wregex(L"\\&nbsp;"), L" ");
    return s3;
}

Interesting, but it should also include html entities conversion (you know, stuff like `<`) — xryl669, Apr 07 '20 at 16:30

score 0 · Accepted Answer · answered Mar 09 '10 at 15:19

0

Try using regular expression extracting html tags and save result as file text. But it not simple. Use this help class DEELX - Regular Expression Engine.

answered Mar 09 '10 at 15:19

lsalamon

7,998
6
50
63

Thank your for the link to DEELX. I dont always like having to deal with including boost. – mfperzel Mar 09 '10 at 16:51

score 0 · Answer 5 · answered Mar 09 '10 at 15:26

0

Take a look at html2text. It's a command tool and not a pure lib, but contains code which strips and converts html. So you should be able to use it.

answered Mar 09 '10 at 15:26

Martin Wickman

19,662
12
82
106

Convert HTML to Plain Text using c++

5 Answers5