How to read the Website content in c#?

Question

I want to read the website text without html tags and headers. i just need the text displayed in the web browser.

i don't need like this

<html>
<body>
bla bla </td><td>
bla bla 
<body>
<html>

i just need the text "bla bla bla bla".

I have used the webclient and httpwebrequest methods to get the HTML content and to split the received data but it is not possible because if i change the website the tags may change.

So is there any way to get only the displayed text in the website anagrammatically?

I think you'll need an HTML parser and if you have control of page source, to add an id to the element you want to get, so to get it with a method like getElementById of the parser. — alfoks, May 14 '12 at 07:51

score 5 · Answer 1 · edited May 23 '17 at 12:02

5

You need to use special HTML parser. The only way to get the content of the such non regular language.

See: What is the best way to parse html in C#?

edited May 23 '17 at 12:02

Community

1
1

answered May 14 '12 at 07:48

Tigran

61,654
8
86
123

But this is one way, you can get what you are asking! – Writwick May 14 '12 at 08:04
@azeemAkram: using [HtmlAgilityPack](http://htmlagilitypack.codeplex.com/) you can get the values you're interested in. At the end this is a Parser. – Tigran May 14 '12 at 08:23

yamen · Accepted Answer · 2012-05-14T08:16:47.440

Here is how you would do it using the HtmlAgilityPack.

First your sample HTML:

var html = "<html>\r\n<body>\r\nbla bla </td><td>\r\nbla bla \r\n<body>\r\n<html>";

Load it up (as a string in this case):

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

If getting it from the web, similar:

var web = new HtmlWeb();
var doc = web.Load(url);

Now select only text nodes with non-whitespace and trim them.

var text = doc.DocumentNode.Descendants()
              .Where(x => x.NodeType == HtmlNodeType.Text && x.InnerText.Trim().Length > 0)
              .Select(x => x.InnerText.Trim());

You can get this as a single joined string if you like:

String.Join(" ", text)

Of course this will only work for simple web pages. Anything complex will also return nodes with data you clearly don't want, such as javascript functions etc.

:: how can i access the text index by index as i do with string array in loop like this 'for(i=0;i — Azeem Akram, May 14 '12 at 10:14
You can do it against `text` directly: `foreach (var index in text) { // do something with index }`. Alternatively, you can do a `text.ToArray();` and deal with it as an array. — yamen, May 14 '12 at 10:20

score 0 · Answer 3 · answered Jan 04 '14 at 15:40

0

public string GetwebContent(string urlForGet)
{
    // Create WebClient
    var client = new WebClient();
    // Download Text From web
    var text = client.DownloadString(urlForGet);
    return text.ToString();
}

answered Jan 04 '14 at 15:40

user3059036

31
1
3

ductran · Answer 4 · 2012-05-14T14:08:33.810

-1

I think this link can help you.

/// <summary>
/// Remove HTML tags from string using char array.
/// </summary>
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;

for (int i = 0; i < source.Length; i++)
{
    char let = source[i];
    if (let == '<')
    {
    inside = true;
    continue;
    }
    if (let == '>')
    {
    inside = false;
    continue;
    }
    if (!inside)
    {
    array[arrayIndex] = let;
    arrayIndex++;
    }
}
return new string(array, 0, arrayIndex);
}

edited May 14 '12 at 14:08

answered May 14 '12 at 08:09

ductran

10,043
19
82
165

regular expressions should not be used to parse HTML – crdx May 14 '12 at 11:37
The author give you 3 methods. The last one (StripTagsCharArray) is recommend – ductran May 14 '12 at 14:07
2

How do you think this method will manage if it encounters an if statement within some embedded JavaScript like 'if x < 4'? The answer is: not very well. The correct answer is the one that suggests HtmlAgilityPack. – crdx May 14 '12 at 17:23

score -2 · Answer 5 · answered May 14 '12 at 07:47

-2

// Reading Web page content in c# program
//Specify the Web page to read
WebRequest request = WebRequest.Create("http://aspspider.info/snallathambi/default.aspx");
//Get the response
WebResponse response = request.GetResponse(); 
//Read the stream from the response
StreamReader reader = new StreamReader(response.GetResponseStream()); 
//Read the text from stream reader
string str = reader.ReadLine();
for(int i=0;i<200;i++)
{
   str += reader.ReadLine();

}

Console.Write(str);

answered May 14 '12 at 07:47

Jaiff

497
2
5
17

You can not treat HTML like a simple text or with regualr expressions, it's **not** a regular text or language. – Tigran May 14 '12 at 07:49
@jaiff :: could you please elaborate the last loop that why are you reading it to only 200 indexes. – Azeem Akram May 14 '12 at 08:03

How to read the Website content in c#?

5 Answers5

Linked