Download HTML file and convert it to TXT

Question

I am writing a program in c#. I need to know if there an option to open an URL of a site and look for keywords in the text. For example if my program gets the URL http://www.google.com and the keyword "gmail" it will return true. So for conclusion i need to know if there a way to go to URL download the HTML file convert it to text so i could look for my keyword.

score 2 · Answer 1 · edited May 23 '17 at 12:16

2

It sounds like you want to remove all the HTML tags and then search the resulting text.

My first reaction was to use a Regular Expression:

String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);

Shamelessly stole this from: Using C# regular expressions to remove HTML tags

Which suggests the HTML Agility Pack which sounds exactly like what you're looking for.

edited May 23 '17 at 12:16

Community

1
1

answered Aug 18 '11 at 17:33

userx

3,769
1
23
33

i am looking to know if there a way to download an html file and convert it to txt file – yoni2 Aug 18 '11 at 17:37

AndyD273 · Answer 2 · 2011-08-19T15:12:13.000

In visual basic this works:

Imports System
Imports System.IO
Imports System.Net

Function MakeRequest(ByVal url As String) As String
    Dim request As WebRequest = WebRequest.Create(url)
    ' If required by the server, set the credentials. '
    request.Credentials = CredentialCache.DefaultCredentials
    ' Get the response. '
    Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse)
    ' Get the stream containing content returned by the server. '
    Dim dataStream As Stream = response.GetResponseStream()
    ' Open the stream using a StreamReader for easy access. '
    Dim reader As New StreamReader(dataStream)
    Dim text As String = reader.ReadToEnd

    Return text
End Function

Edit: For future reference for others that find this page, you pass in a URL, and this function will go to the page, read all the html text, and return it as a text string. then all you have to do is parse it (search for text in the file) or you could use a stream writer to save it to a text or html file if you wanted to.

score 1 · Accepted Answer · answered Aug 18 '11 at 17:35

1

You should be able to open the HTML file as-is. HTML files are plaintext, meaning that FileStream and StreamReader should be sufficient to read the file.

If you really want the file to be a .txt, you can simply save the file as filename.txt instead of filename.html when you download it.

answered Aug 18 '11 at 17:35

asfallows

5,998
6
29
48

I think his problem is actualy downloading the page not converting it to text. Is there such function to do so? – atoMerz Aug 18 '11 at 17:36
and how can i download the html using url ? – yoni2 Aug 18 '11 at 17:39
@yoni2: Take a look at this: http://stackoverflow.com/questions/599275/how-can-i-download-html-source-in-c – asfallows Aug 18 '11 at 19:54

score 0 · Answer 4 · edited Jan 24 '17 at 09:37

0

using (WebClient client = new WebClient()) 
{
   client.DownloadFile("http://example.com", @"D:\filename.txt");
}

edited Jan 24 '17 at 09:37

Izzy

6,740
7
40
84

answered Jan 24 '17 at 08:31

Gokul

788
2
12
30

score 0 · Answer 5 · edited May 23 '17 at 11:52

Do not use regular expressions for parsing html, as html is fairly complex for regular expresions. Check out ling discussion on SO for this

RegEx match open tags except XHTML self-contained tags

Use instead already implemented HTML parsers for this purpose.

Here is another discussion on SO where you can find a links you need

Looking for C# HTML parser

Search also on internet by yourself.

Download HTML file and convert it to TXT

5 Answers5