Pos() within utf8 string boundaries

Question

I'd like to have a Pos() adapted to be used specifying boundaries within the Source string, rather than have it perform the search in the entire data.

Let's say I have a string which is 100 chars long, I want to perform the Pos only between the 5th and 20th character of the (unicode/utf8) string.

The code should be adapted from the ASM fastcode implementation in delphi, and obviously avoid pre-copying the portion of the string to a temporal one, as the purpose is making it faster than that.

My scenario:

I have a string which is accessed many times, and each time, a portion of it is copied to another temporal string, then a Pos is performed on it. I want to avoid the intermediary copy every time, and rather perform the Pos within the boundaries I specify.

Edit: question edited after new one was deemed a duplicate.

I would still like a solution that expands on the current XE3 FastCode assembly implementation, as that would fit my goal here.

Are you tried the [`PosEx`](http://docwiki.embarcadero.com/Libraries/XE3/en/System.StrUtils.PosEx) function? — RRUZ, Jan 05 '13 at 05:43
Yes, PosEx offers specifying the starting position, but not the ending boundary. — hikari, Jan 05 '13 at 06:52
It should be easy for you to copy-paste the existing `Pos` function and alter it to suit your needs. Difficult for us to do because we would not be allowed to redistribute that source code! — Cosmin Prund, Jan 05 '13 at 08:30
If I would know asm yeah, then I wouldn't be asking here :) the current Pos in Delphi was just appropiated from FastCode (http://fastcode.sourceforge.net/) anyway as they often do (http://www.progdigy.com/?p=194), so it should be ok to redistribute. — hikari, Jan 05 '13 at 11:57
In your question it looks as you want both `Unicode` and `AnsiString` functions. If that is correct, note that the fastcode archives only has the `AnsiString` solution. Please clearify this in your question. I have an optimized purepascal solution for you if you want that. — LU RD, Jan 05 '13 at 14:56
I would need an ASM implementation. I need it to work mostly with unicode strings. — hikari, Jan 05 '13 at 15:25
Are you sure that only an asm implementation is good enough? My simple test show a performance gain of 20-40% compared to a copy/pos asm comparison. — LU RD, Jan 05 '13 at 15:42
Deleted the comment, but is it really protected code? Delphi just took it freely from FastCode, which license allowed redistribution. I can try your code if you'd like. I was testing PosStr from madshi's madStrings unit, (madshi.net), which offers boundaries, it's in purepascal and it was ~50% slower than using FC's ASM version while copying the string boundary for every call. — hikari, Jan 05 '13 at 16:01
Yes, the unicode part is protected. Rightfully or not I can't tell. — LU RD, Jan 05 '13 at 16:11
Can you tell us a bit more about the searches you're performing? How does the searched string look like? Are you changing the long string between searches? Is the searched-pattern always the same? Maybe there's a smarter way of doing the same. If you could figure out an alternative algorithm for fixing the problem you might get much grater performance improvements. Even those I can read (and somewhat write ASM) I'm no advocate of ASM, as it's difficult to write and maintain. — Cosmin Prund, Jan 05 '13 at 16:42
Added a comment below your answer with a somewhat similar scenario to what happens in my app. — hikari, Jan 05 '13 at 16:53
Mostly it goes like this: 1) My app scans a very long string (can be ~10MB), and defines certain start/end boundaries in an array of Integers. 2) I loop through all the array entries, each time loading a portion of that source string, whose boundaries are defined by the array of Integers. 3) I perform various search/copy operations on that copied portion of the string 4) Continue to next piece of string (copy portion by defined boundaries, do searches, repeat). — hikari, Jan 05 '13 at 16:57
I just looked over your PasteBin. You're simulating looking for over 1000 sub-strings in each string segment. If that's so, you should implement a LEXER, and it'll be 1000 times faster (ie: a speed factor equal to the number of strings you're looking for). Write a bit about this in the question and i'd even provide an answer. — Cosmin Prund, Jan 05 '13 at 17:03
A Lexer.. never heard about it, tell me more :) Basically, you could say the part in my program I'm trying to improve with this is a custom XML parser. I define a node name, f.e , scan through the entire string looking for and , and store each position in an array of record (with 2 integer values) that define the boundaries for each node. This first part is very fast, as I go through the string with PosEx updating the offset as it goes. Then I have a number of found Nodes; for each node I copy the corresponding substr, and then perform various Pos/Copy operations — hikari, Jan 05 '13 at 17:11
within that portion of the string I copied, corresponding to the requested node. Then next node is requested, I copy the portion defined by the next boundaries, pos/copy (subnodes). So my idea is that if I could skip copying the portion of the string for every LoadNode iteration and perform the Pos searches within those boundaries over the original string, it would speed things up. — hikari, Jan 05 '13 at 17:13
A properly designed XML parser would only go over the input string *once*, there's no need for a recursive implementation (the word `iteration` hints you've got a recursive implementation). Parsing time should be linear to the size of the string! If you hope to implement an application-specific fast XML parser you should look into parser technology: Lexers and Parsers. There are automated parser generators (I *really* like `GOLD Parser Generator`, take a look!). Make sure you look into existing implementations, only write your own if it's *application specific* and you desperately need it. — Cosmin Prund, Jan 05 '13 at 17:25
I don't think you fully understood how my parser works, but I'd need to expose the entire code I guess for it. I tried other parsers before making my own, they were all much slower. If you can point me to something you think will be faster please link me so I can check :) — hikari, Jan 05 '13 at 17:33
I really want to stress this a bit more: There's a lot of weirdness in the XML format, *only* write your own if you really need to. Don't do it for 40% speed improvement, it's not worth it; Next year's computers generators will be 40% faster and you'd be stuck with a non-standard XML parser and no other new features. — Cosmin Prund, Jan 05 '13 at 17:33
My parser is used with very specific data, it's not a generic thing that should be future-proof etc, and for this specific data it's a LOT faster than anything else I tried. By specific, I mean very simple in the sort of datablabla..... — hikari, Jan 05 '13 at 17:34
Can you link a real 10mb XML file? What application produces it? — Cosmin Prund, Jan 05 '13 at 17:47
Well I could a demo file for the testing, I'd rather not disclose the kind of data I'm parsing (not generated by me). brb — hikari, Jan 05 '13 at 17:48
I've looked at it. Can't say much about it... I already mentioned, you don't fix this kind of performance problem by finding a faster ASM implementation of your existing algorithm, you fix the algorithm and you get significant improvements that way. If you specifically ask a question about parsing *that* XML fast, you'll probably get some useful answers and comments. There's only so much I can do with a 450 char comment, I obviously can't post an answer with code here, because you're asking for a different thing here. — Cosmin Prund, Jan 05 '13 at 18:39
Ok, well I'll accept here and ask a separate question, I'd appreciate if you can contribute there as well, thanks! :) I'll link the new question here. — hikari, Jan 05 '13 at 18:44
http://stackoverflow.com/questions/14175188/delphi-faster-xml-parser http://stackoverflow.com/questions/14175173/delphi-pos-with-boundaries — hikari, Jan 05 '13 at 19:01

LU RD · Answer 1 · 2013-01-05T19:44:19.627

1

Here is an alternative that is not based on asm. It will also work on a 64-bit application.

function PosExUBound(const SubStr, Str: UnicodeString; Offset,EndPos: Integer): Integer; overload;
var
  I, LIterCnt, L, J: NativeInt;
  PSubStr, PS: PWideChar;
begin
  L := Length(SubStr);
  if (EndPos > Length(Str)) then
    EndPos := Length(Str);
  { Calculate the number of possible iterations. Not valid if Offset < 1. }

  LIterCnt := EndPos - Offset - L + 1;

  {- Only continue if the number of iterations is positive or zero (there is space to check) }
  if (Offset > 0) and (LIterCnt >= 0) and (L > 0) then
  begin
    PSubStr := PWideChar(SubStr);
    PS := PWideChar(Str);
    Inc(PS, Offset - 1);

    Dec(L);
    I := 0;
    J := L;
    repeat
      if PS[I + J] <> PSubStr[J] then
      begin
        Inc(I);
        J := L;
        Dec(LIterCnt);
        if (LIterCnt < 0)
          then Exit(0);
      end
      else
      if (J > 0) then
        Dec(J)
      else
        Exit(I + Offset);
    until false;
  end;

  Result := 0;
end;

I will leave it as an excercise to implement an AnsiString overloaded version.

BTW, the purepascal parts of the Pos() functions in XE3 are to put it mildly poorly written. See QC111103 Inefficient loop in Pos() for purepascal. Give it a vote if you like.

edited Jan 05 '13 at 19:44

answered Jan 05 '13 at 16:07

LU RD

34,438
5
88
296

Tried your function, it takes around ~260ms in my tests, while the ASM+copy way using Pos takes 180-185. – hikari Jan 05 '13 at 16:16
Are you sure you compiled the Pascal code with optimization on and debugging off? I really doubt the COPY+ASM implementation could possibly be 40% faster then this. And be aware that the ASM version is not portable: you'd need different versions for 32bit and 64bit. – Cosmin Prund Jan 05 '13 at 16:24
I've put together a little code to better simulate that part of code in my full application: http://pastebin.com/Xm8J1psh Test1 takes ~275ms in my computer, test2 takes around 400. Ignore the Perf lines, they link to another unit for time measurement with perf. counters. – hikari Jan 05 '13 at 16:49
1

Here's a PasteBin of mine: Compare's LU RD's algorithm against Pos-over-substring, and LURD's algoritm almost twice faster: http://pastebin.com/LXmJBn2Y – Cosmin Prund Jan 05 '13 at 16:58
In your PasteBin example you're chopping your 32 char sub-string and then you're simulating searching for 1000 different sub-strings in it. If that's true, stop looking for a `PosExBound` function because it will not be more effective: The creation of the substring is a tiny portion of the algorithm, it's not the bottleneck. You're not seeing improvements because you haven't correctly identified the bottleneck. – Cosmin Prund Jan 05 '13 at 17:06
Replied in the comments above, with a more detailed description of my code. – hikari Jan 05 '13 at 17:16
Back to your code, the 2nd test is faster, but that's because the substring is copied for each iteration. In my code, this doesn't happen all the time but just once per node, so the following Pos operations don't have to rebuild it until a different portion of the string is needed for more searches, which results in using Pos still being the faster option. – hikari Jan 05 '13 at 17:45
@hikari, not my code, LU RD's code; Only the test code is mine. The second iteration does what you're asking for in the question, and it's faster. It doesn't look that way in your code because you're testing for different things. Normally the COPYing of the code should provide a huge performance hit, but since you're COPYing the substring once and then you're doing 1000 searches on it, the penalty is distributed over 1000 iterations. It makes the copying almost irrelevant, since your bottleneck is the `Pos` call over the substring, not the creation of the substring itself. – Cosmin Prund Jan 05 '13 at 17:50
Still, performing a Pos directly on the original string without performing the previous copy (over thousands of times) would speed things up. I'd need only a modification of the ASM Pos code to treat the given Source string with a pre-defined length you give it, rather than using the full length of the string. – hikari Jan 05 '13 at 17:54
@hikari - Well, your question did not mention the part that the copied substrings could be reused. So in that case `Copy/Pos` will show a speed advantage. Maybe a good asm coder can help you, but I think he would have to be hired to do the job. – LU RD Jan 05 '13 at 17:54
1

My test-code shows that LU RD's answer is *perfect* for the *question you asked* because it's twice faster if you only search for one substring, and that's what you asked for (the `[...]then a Pos is performed[...]` part of your question. You're asking for ONE Pos, not 1000 Pos'es; If you asked for 1000 Pos'es I'm sure LU RD wouldn't have bothered to answer, because there's nothing to answer. The copy-ing of the substring has minimal overall impact, so even if you had the ASM function you're asking for, it's overall performance improvement would be minimal. – Cosmin Prund Jan 05 '13 at 17:56
I think you should accept this answer and ask a new, improved question. Don't forget to mention you're parsing XML, don't forget to provide a sample XML file (link it!) and explain why you think a generic parser is not what you need. – Cosmin Prund Jan 05 '13 at 17:58
The copy operations do have a significant impact. Consider the following code: http://pastebin.com/zATXAU6d the copy operations add around 70ms to the entire parsing process, in a very fast computer (i5 4Ghz). while 70ms isn't much, in a real scenario the data being handled is much larger and happens more often. I can't fully describe my application, but I do have a need to improve this process. If an operation takes 300ms every time, where it could take 200ms, it all sums up after a few times. – hikari Jan 05 '13 at 18:24
1

Let's do some math. You need to search for 1000 tiny strings into a 32 char substring. Approach one: You create the 32 char substring, then you use `Pos` 1000 times. Approach two: You use `PosExUBound` 1000 times. Total time for the first variant: `Time(Copy)+1000*Time(Pos)`; Total time for second variant: `1000*Time(PosExUBound)`; If you want the second variant to be faster then your first, then `Time(PosExUBound)<(Time(Pos)+Time(Copy)/1000)`. Of course it's possible, but it will be *marginal*. I bet a parser will be a 1000 times faster: it'll walk the 10Mb string *once*. It's your choice... – Cosmin Prund Jan 05 '13 at 18:48

Pos() within utf8 string boundaries

1 Answers1

Linked