I am writing a program in c#. I need to know if there an option to open an URL of a site and look for keywords in the text. For example if my program gets the URL http://www.google.com and the keyword "gmail" it will return true. So for conclusion i need to know if there a way to go to URL download the HTML file convert it to text so i could look for my keyword.
5 Answers
It sounds like you want to remove all the HTML tags and then search the resulting text.
My first reaction was to use a Regular Expression:
String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);
Shamelessly stole this from: Using C# regular expressions to remove HTML tags
Which suggests the HTML Agility Pack which sounds exactly like what you're looking for.
-
i am looking to know if there a way to download an html file and convert it to txt file – yoni2 Aug 18 '11 at 17:37
In visual basic this works:
Imports System
Imports System.IO
Imports System.Net
Function MakeRequest(ByVal url As String) As String
Dim request As WebRequest = WebRequest.Create(url)
' If required by the server, set the credentials. '
request.Credentials = CredentialCache.DefaultCredentials
' Get the response. '
Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse)
' Get the stream containing content returned by the server. '
Dim dataStream As Stream = response.GetResponseStream()
' Open the stream using a StreamReader for easy access. '
Dim reader As New StreamReader(dataStream)
Dim text As String = reader.ReadToEnd
Return text
End Function
Edit: For future reference for others that find this page, you pass in a URL, and this function will go to the page, read all the html text, and return it as a text string. then all you have to do is parse it (search for text in the file) or you could use a stream writer to save it to a text or html file if you wanted to.

- 7,177
- 12
- 54
- 92
You should be able to open the HTML file as-is. HTML files are plaintext, meaning that FileStream
and StreamReader
should be sufficient to read the file.
If you really want the file to be a .txt, you can simply save the file as filename.txt
instead of filename.html
when you download it.

- 5,998
- 6
- 29
- 48
-
I think his problem is actualy downloading the page not converting it to text. Is there such function to do so? – atoMerz Aug 18 '11 at 17:36
-
-
@yoni2: Take a look at this: http://stackoverflow.com/questions/599275/how-can-i-download-html-source-in-c – asfallows Aug 18 '11 at 19:54
Do not use regular expressions for parsing html, as html is fairly complex for regular expresions. Check out ling discussion on SO for this
RegEx match open tags except XHTML self-contained tags
Use instead already implemented HTML parsers for this purpose.
Here is another discussion on SO where you can find a links you need
Search also on internet by yourself.