0

First the IDE that i am using is the visual C# with .net framework.

Okay so i have about 20,000 html documents with information i need to extract and sort into date order.

The date on the files are stored within this html tag

<td valign="top" class="createdate">
        Tuesday, 03 April 2012 20:39    
</td>

note: all of the dates are in that format within each html file

I want to extract the date then want to automatically read through each html document and measure the occurrences of a phrase or word.

I am not asking someone to create the entire program for me but if you could provide as much detail on how i could sort through these 20000 html files and extract the date and number of occurrences of a word or phrase and then export that information to a word format or excel i would be very grateful.

Ooh and i am using the data for research for my dissertation, i know how to do string manipulation on well strings and all of the string methods such as finding the occurrence of a word etc.

The problem i am having is how do i get the html data or maybe just the content and then sort them into a usable format. Thank you

ree
  • 59
  • 10
  • 2
    HTML Agility Pack is great for parsing HTML: http://htmlagilitypack.codeplex.com/ – greg84 Sep 15 '12 at 12:58
  • Are the documents XHTML documents? If so, you could interpret the files as XML files and use an XQuery to extract the date. It could then be used to rename the file to contain the date or something. If the Documents are not well formed XML, you could build a DOM from the documents and query that. – Jost Sep 15 '12 at 12:59
  • Try breaking the problem up in parts, write some code for each part, then ask again here if you get stuck, with a code example and what you have tried. You will get much better answers this way, your question is too broad for the format of SO. You have already got a couple of answers that hints at the individual parts of a solution. – driis Sep 15 '12 at 13:05
  • Context is OK but need to have a specific question. Lead with the last paragraph and have one question with a ? mark. I don't think this link should be your primary approach but might be a tool http://stackoverflow.com/questions/10334984/regex-for-all-strings-that-pass-net-datetime-parse-culture-en-us – paparazzo Sep 15 '12 at 13:40

1 Answers1

1

Are you sure that all the HTML documents has that exact format ? In this case, the string containing the date can be extracted by simple string operations or via RegEx (Side, note, in general, regular expressions are not suited for parsing HTML, but for this use case, keeping it simple sounds like the way to go here). If you need to do heavier parsing, consider HtmlAgilityPack.

Then use DateTime.TryParse to get the date converted from string into a DateTimeobject.

Community
  • 1
  • 1
driis
  • 161,458
  • 45
  • 265
  • 341