0

I have a web server based on TIdHTTPServer. It is built in Delphi Sydney. From a webpage I'm receiving following multipart/form-data post stream:

-----------------------------16857441221270830881532229640 
Content-Disposition: form-data; name="d"

83AAAFUaVVs4Q07z
-----------------------------16857441221270830881532229640 
Content-Disposition: form-data; name="dir"

Upload
-----------------------------16857441221270830881532229640 
Content-Disposition: form-data; name="file_name"; filename="česká tečka.png"
Content-Type: image/png

PNG_DATA    
-----------------------------16857441221270830881532229640--

Problem is that text parts are not received correctly. I read the Indy MIME decoding of Multipart/Form-Data Requests returns trailing CR/LF and changed transfer encoding to 8bit which helps to receive file correctly, but received file name is still wrong (dir should be Upload and filename should be česká tečka.png).

d=83AAAFUaVVs4Q07z
dir=UploadW
??esk?? te??ka.png 75

To demonstrate the issue I simplified my code to a console app (please note that the MIME.txt file contains the same as is in post stream above):

program MIMEMultiPartTest;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.Classes, System.SysUtils,
  IdGlobal, IdCoder, IdMessage, IdMessageCoder, IdGlobalProtocols, IdCoderMIME, IdMessageCoderMIME,
  IdCoderQuotedPrintable, IdCoderBinHex4;


procedure ProcessAttachmentPart(var Decoder: TIdMessageDecoder; var MsgEnd: Boolean);
var
  MS: TMemoryStream;
  Name: string;
  Value: string;
  NewDecoder: TIdMessageDecoder;
begin
  MS := TMemoryStream.Create;
  try
    // http://stackoverflow.com/questions/27257577/indy-mime-decoding-of-multipart-form-data-requests-returns-trailing-cr-lf
    TIdMessageDecoderMIME(Decoder).Headers.Values['Content-Transfer-Encoding'] := '8bit';
    TIdMessageDecoderMIME(Decoder).BodyEncoded := False;
    NewDecoder := Decoder.ReadBody(MS, MsgEnd);
    MS.Position := 0; // nutne?
    if Decoder.Filename <> EmptyStr then // je to atachment
    begin
      try
        Writeln(Decoder.Filename + ' ' + IntToStr(MS.Size));
      except
        FreeAndNil(NewDecoder);
        Writeln('Error processing MIME');
      end;
    end
    else // je to parametr
    begin
      Name := ExtractHeaderSubItem(Decoder.Headers.Text, 'name', QuoteHTTP);
      if Name <> EmptyStr then
      begin
        Value := string(PAnsiChar(MS.Memory));
        try
          Writeln(Name + '=' + Value);
        except
          FreeAndNil(NewDecoder);
        Writeln('Error processing MIME');
        end;
      end;
    end;
    Decoder.Free;
    Decoder := NewDecoder;
  finally
    MS.Free;
  end;
end;

function ProcessMultiPart(const ContentType: string; Stream: TStream): Boolean;
var
  Boundary: string;
  BoundaryStart: string;
  BoundaryEnd: string;
  Decoder: TIdMessageDecoder;
  Line: string;
  BoundaryFound: Boolean;
  IsStartBoundary: Boolean;
  MsgEnd: Boolean;
begin
  Result := False;
  Boundary := ExtractHeaderSubItem('multipart/form-data; boundary=---------------------------16857441221270830881532229640', 'boundary', QuoteHTTP);
  if Boundary <> EmptyStr then
  begin
    BoundaryStart := '--' + Boundary;
    BoundaryEnd := BoundaryStart + '--';
    Decoder := TIdMessageDecoderMIME.Create(nil);
    try
      TIdMessageDecoderMIME(Decoder).MIMEBoundary := Boundary;
      Decoder.SourceStream := Stream;
      Decoder.FreeSourceStream := False;
      BoundaryFound := False;
      IsStartBoundary := False;
      repeat
        Line := ReadLnFromStream(Stream, -1, True);
        if Line = BoundaryStart then
        begin
          BoundaryFound := True;
          IsStartBoundary := True;
        end
        else
        begin
          if Line = BoundaryEnd then
            BoundaryFound := True;
        end;
      until BoundaryFound;
      if BoundaryFound and IsStartBoundary then
      begin
        MsgEnd := False;
        repeat
          TIdMessageDecoderMIME(Decoder).MIMEBoundary := Boundary;
          Decoder.SourceStream := Stream;
          Decoder.FreeSourceStream := False;
          Decoder.ReadHeader;
          case Decoder.PartType of
            mcptText,
            mcptAttachment:
              begin
                ProcessAttachmentPart(Decoder, MsgEnd);
              end;
            mcptIgnore:
              begin
                Decoder.Free;
                Decoder := TIdMessageDecoderMIME.Create(nil);
              end;
            mcptEOF:
              begin
                Decoder.Free;
                MsgEnd := True;
              end;
          end;
        until (Decoder = nil) or MsgEnd;
        Result := True;
      end
    finally
      Decoder.Free;
    end;
  end;
end;

var
  Stream: TMemoryStream;
begin
  Stream := TMemoryStream.Create;
  try
    Stream.LoadFromFile('MIME.txt');
    ProcessMultiPart('multipart/form-data; boundary=---------------------------16857441221270830881532229640', Stream);
  finally
    Stream.Free;
  end;
  Readln;
end.

Could someone help me what is wrong with my code? Thank you.

stepand76
  • 467
  • 6
  • 17

2 Answers2

2

Your call to ExtractHeaderSubItem() in ProcessMultiPart() is wrong, it needs to pass in the ContentType string parameter, not a hard-coded string literal.

Your call to ExtractHeaderSubItem() in ProcessAttachmentPart() is also wrong, it needs to pass in only the content of just the Content-Disposition header, not the entire Headers.Text. ExtractHeaderSubItem() is designed to only operate on 1 header at a time.

Regarding the dir MIME part, the reason the body data ends up as 'UploadW' instead of 'Upload' is because you are not taking MS.Size into account when assigning MS.Memory to your Value string. The TMemoryStream data is NOT null-terminated! So, you will need to use SetString() instead of the := operator, eg:

var
  Value: AnsiString;
...
SetString(Value, PAnsiChar(MS.Memory), MS.Size);

Regarding the Decoder.FileName, that value is not affected by the Content-Transfer-Encoding header at all. MIME headers simply do not allow unencoded Unicode characters. Currently, Indy's MIME decoder supports RFC2047-style encodings for Unicode characters in headers, per RFC 7578 Section 5.1.3, but your stream data is not using that format. It looks like your data is using raw UTF-8 octets 1 (which 5.1.3 also mentions as a possible encoding, but the decoder does not currently look for). So, you may have to manually extract and decode the original filename yourself as needed. If you know the filename will always be encoded as UTF-8, you could try setting Indy's global IdGlobal.GIdDefaultTextEncoding variable to encUTF8 (it defaults to encASCII), and then the Decoder.FileName should be accurate. But, that is a global setting, so may have unwanted side effects elsewhere in Indy, depending on context and data. So, I would suggest setting GIdDefaultTextEncoding to enc8Bit instead, so that unwanted side effects are minimized, and the Decoder.FileName will contain the original raw bytes as-is (just extended to 16-bit chars). That way, you can recover the original filename bytes by simply passing the Decoder.FileName as-is to IndyTextEncoding_8Bit.GetBytes(), and then decode them as needed (such as with IndyTextEncoding_UTF8.GetString(), after validating the bytes are valid UTF-8).

1: However, ÄŤeská teÄŤka.png is not the correct UTF-8 form of česká tečka.png, it looks like that data may have been double-encoded, ie česká tečka.png was UTF-8 encoded, and then the resulting bytes were UTF-8 encoded again

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • The stream data get malformed while editing my post here on SO. Sorry for that. I fixed it in original post. The post stream comes from RequestInfo.PostStream.SaveTofile. Calling ExtractHeaderSubItem() in ProcessMultiPart() is just for demonstration. – stepand76 Nov 30 '21 at 21:31
  • When I use TIdHTTP + TIdMultiPartFormDataStream to upload same file that it works, because it adds encoding specification for all MIME parts and encode them. But I need to get it working also in case the request comes from a web browser. How to handle it on server? Or how to change the website form to be compatible with server? – stepand76 Nov 30 '21 at 21:49
  • I have updated my answer regarding the `dir` part data. There is a bug in your code, and I explain the fix. As for the filename, if a browser sends the filename in a format that Indy does not support yet (and I can't fathom why a web browser would double-UTF-encode the filename as you have shown), then you will just have to decode the filename manually, as I stated in my answer. – Remy Lebeau Dec 01 '21 at 00:29
  • Remy, Thank you for fixing the dir value. For the filename in my opinion the data is UTF-8 encoded only once, not double as you wrote. I can decode it, but it is quite complicated because I need to exclude binary part (PNG_DATA in my case) from UTF-8-decoding as it is not encoded. So I'm very close to process whole message manually without Indy classes... – stepand76 Dec 01 '21 at 08:05
  • I've updated my answer regarding the `FileName` decoding. – Remy Lebeau Dec 01 '21 at 16:50
  • Remy, thank you again. I can mark this as an answer even though it is a workaround. It would be great to have better MIME processing support in Indy. Because this (uploading form from web page) is a standard case and many others can experience this issue. Thank you. – stepand76 Dec 01 '21 at 20:13
  • There is already a feature request in Indy's issue tracker for that: [#138 Update TIdHTTPServer to support "multipart/form-data" posts](https://github.com/IndySockets/Indy/issues/138) – Remy Lebeau Dec 01 '21 at 20:16
-1

Nowadays the filename parameter should only be added for fallback reasons, while filename* should be added to clearly tell which text encoding the filename has. Otherwise each client only guesses and supposes. Which may go wrong.

  • RFC 5987 §3.2 defines the format of that filename* parameter:

    charset ' [ language ] ' value-chars

    ...whereas:

    charset can be UTF-8 or ISO-8859-1 or any MIME-charset

    ...and the language is optional.

  • RFC 6266 §4.3 defines that filename* should be used and comes up with examples in §5:

    Content-Disposition: attachment; filename="EURO rates"; filename*=utf-8''%e2%82%ac%20rates`
    

Do you spot the asterisk *? Do you spot the text encoding utf-8? Do you spot the two apostrophes '', designating no further specified language (see RFC 5646 § 2.1)? And then come the octets according to the specified text encoding: either percent-encoded, or (if allowed) in plain ASCII.

Other examples:

  • Content-Disposition: attachment; filename="green.jpg"; filename*=UTF-8''%e3%82%b0%e3%83%aa%e3%83%bc%e3%83%b3.jpg
    

    will present "green.jpg" on older web browsers and "グリーン.jpg" on compliant web browsers.

  • Content-Disposition: attachment; filename="Gruesse.txt"; filename*=ISO-8859-1''Gr%fc%dfe.txt
    

    will present "Gruesse.txt" on older web browsers and "Grüße.txt" on compliant web browsers.

  • Content-Disposition: attachment; filename="Hello.png"; filename*=Shift_JIS'en-US'Howdy.png; filename*=EUC-KR'de'Hallo.png
    

    will present "Hello.png" on older web browsers, and "Howdy.png" on compliant web browsers where the preferred language is set to American English, and "Hallo.png" on compliant ones with a preferred language of German (Deutsch). Note that the different text encodings are unbound to percent encoding as long as the octets are within the allowed range (and latin letters are, along with the dot).

From my experiences nobody cares for this nice feature - everybody just shoves UTF-8 into filename, which still violates the standard - no matter how many clients silently support it. Linking How to encode the filename parameter of Content-Disposition header in HTTP? and PHP: RFC-2231 How to encode UTF-8 String as Content-Disposition filename.

AmigoJack
  • 5,234
  • 1
  • 15
  • 31
  • [RFC 7578](https://datatracker.ietf.org/doc/html/rfc7578) forbids the use of `filename*` in `Content-Disposition` headers of `multipart/form-data`: "*NOTE: The encoding method described in [RFC5987], which would add a "filename\*" parameter to the Content-Disposition header field, MUST NOT be used.*" And RFC 6266 applies only to HTTP `Content-Disposition` headers, not to MIME headers. It even says so: "*Note: This document does not apply to Content-Disposition header fields appearing in payload bodies transmitted over HTTP, such as when using the media type "multipart/form-data" ([RFC2388]).*" – Remy Lebeau Dec 01 '21 at 00:38
  • Those are news to me, thanks for pointing out. What I describe was legal for a time span of ~5 years tho. Should I delete my A? – AmigoJack Dec 01 '21 at 01:35