0

I'm trying to use Indy http server to find keywords within a webpage for a proxy filter. I've set up a proxy and the http server, which works with web browsers, but I'm struggling when it comes to finding a keyword within the web page.

I've been trying to convert a memory stream to string and searching for a keyword within it but maybe this is the wrong way to be doing it. I have limited experience with delphi so I'm slightly stuck.

If anyone could give me any pointers, that would be great.

Thanks.

EDIT: Ok I have added a function here where 'Stream' is the memory stream from the http server and 'what' is the keyword I'm searching, it doesn't seem to work though....

function FindInMemStream(Stream: TMemoryStream; What: String):Integer;
var
  bufBuffer, bufBuffer2: array[0..254] of Char;
  i: Integer;
begin
filter.Form2.ListBox1.Items.Add('finding');
  What := 'train';
  Result := 0;
  i := 0;
  FillChar(bufBuffer, 255, #0);          
  FillChar(bufBuffer2, 255, #0);       
  StrPCopy(@bufBuffer2, What);           
  Stream.Position:=0;
  while Stream.Position <> Stream.Size do   
  begin
    Stream.Read(bufBuffer[0],Length(What));   
    if CompareMem(@bufBuffer,@bufBuffer2,Length(What)) then       
    begin
    filter.Form2.ListBox1.Items.Add(IntToStr(Stream.Position-Length(What)));
      Result := Stream.Position-Length(What);  // not 0 : it's found keyphrase
      Exit;
    end;
    i := i + 1;
  //  filter.Form2.ListBox1.Items.Add(IntToStr(i));
    Stream.Seek(i,0)     
  end;
end;
Arioch 'The
  • 15,799
  • 35
  • 62
user1365875
  • 139
  • 2
  • 4
  • 15
  • What is the *keyword* you're talking about ? Is that part of a response header, or content ? Could you describe it more in your question ? – TLama Apr 16 '13 at 10:28
  • Just a certain keyword within the html; a search term. 'banking' for example. – user1365875 Apr 16 '13 at 10:30
  • Well, then you can just treat that content as HTML document by parsing it by MSHTML for instance, and check if that keyword is value of a certain HTML tag (if it is so). That's all you can do with that content (but even that is much safer than just checking if that string is part of the content you received). – TLama Apr 16 '13 at 10:38
  • Ok, great thanks. Do you know of any tutorials/examples for parsing from http server? Thanks – user1365875 Apr 16 '13 at 10:53
  • I don't, since there's a plenty of them. From a quick search on StackOverflow you can take a look e.g. at [`this example`](http://stackoverflow.com/a/14349613/960757). If you were more specific as I've asked before, you might get more precise answer (not a comment ;-) – TLama Apr 16 '13 at 10:59
  • Ok, I've tried to add some context by adding a function above. Not sure if this will make sense? – user1365875 Apr 16 '13 at 12:17
  • 2
    When comparing text you should ensure that the stream and the buffer both have the same text encoding (i.e. the same byte representation). You assume your stream contains UTF16LE bytes, but the content may be in UTF8, Windows-1251 etc. so you will not be able to find a match even if the "match" visually exist. – iPath ツ Apr 16 '13 at 14:32
  • Also I don't see this line to be very useful: [filter.Form2.ListBox1.Items.Add('finding');] - the text "finding" will become visible AFTER your search routine has finished – iPath ツ Apr 16 '13 at 14:35
  • with chopping stream to bytes you may loose any entrance of text; For example you look for "12345678" string, and then you chopped into bufBuffer1 "....12345" and later you filled bufBuffer1 with "678...." - and here you found nothing and missed it! More so, if you limited parameters with TMEmoryStream - you have no needs for buffer - you just can address internal buffer of TMemoryStream directly. But that still would not help you, because of different charsets and possible ZIPping of pages.... So use ready-made filtering proxy software like Proxomitron. Or at least ready-made parser. – Arioch 'The Apr 16 '13 at 15:39
  • or at very least use Knuth–Morris–Pratt – Arioch 'The Apr 16 '13 at 15:39
  • Oh, and if you found your keyword, what would you do then ? maybe there is the 2nd occurrence of keyword in the page later, and the 3rd one neare the end. Will you stop searchign after the 1st one ? or would you re-start searching from beginning of stream ending with O(x^2) efficiency? – Arioch 'The Apr 16 '13 at 15:42

1 Answers1

2

There are libraries which can be used for HTML parsing, for example the (commercial) DIHtmlParser.

DIHtmlParser reads, extracts information from, and writes HTML, XHTML, and XML.

From its feature list:

  • Full Unicode support (UnicodeString or WideString, depending on Delphi version).
  • Reads and writes over 70 character sets natively (independent of the OS).
  • Operates on TStreams, memory buffers or strings.
  • Returns a single piece of HTML to the application at a time.

With such a library, the HTML content (visible text) can be extracted easily from the HTML response, and the remaining task to find the search term would become trivial.

I would not try to write my own HTML parser, but rather use an existing library.

mjn
  • 36,362
  • 28
  • 176
  • 378