1

I have a Content-Disposition header as such:

Content-Disposition: attachment; filename="övrigt.xlsx"; filename*=utf-8''%C3%B6vrigt.xlsx

According to specs there is either a filename="filename.extension" and/or filename*=charencoding''filename.extension. When filename* is present it should be used over filename.

So I want to catch the filename and the character encoding in the filename* attribute over the filename attribute when present. I ended up with this regex:

filename\*?=(?:([^'"]*)''|("))([^;]+)\2(?:[;`\n]|$)

It works fine, the only problem I have is that it matches whatever comes first, filename* or filename:

  1. attachment; filename*=utf-8''%C3%B6vrigt.xlsx; filename="övrigt.xlsx"

Matches:

Match 1
Full match  12-45   filename*=utf-8''%C3%B6vrigt.xlsx;
Group 1.    n/a     utf-8
Group 3.    n/a     %C3%B6vrigt.xlsx
  1. attachment; filename="övrigt.xlsx"; filename*=utf-8''%C3%B6vrigt.xlsx

Matches:

Match 1
Full match  12-35   filename="övrigt.xlsx";
Group 2.    n/a     "
Group 3.    n/a     övrigt.xlsx

Group 1 always matches character encoding when present.
Group 3 always matches the filename.

So I can now use filename and decode when group1 is not empty...


So to get to the question:

As I understood the *? should greedily try to match filename with * (see reference here):

The question mark is the first metacharacter introduced by this tutorial that is greedy. The question mark gives the regex engine two choices: try to match the part the question mark applies to, or do not try to match it. The engine always tries to match that part. Only if this causes the entire regular expression to fail, will the engine try ignoring the part the question mark applies to.

Why does it not work as expected, what am I doing wrong. How can I achieve matching of filename*= over filename= if present.

Wilt
  • 41,477
  • 12
  • 152
  • 203
  • Try using `(?:.*filename\*|filename)=` instead of `filename\*?=`. I assume you will always have a single match, be it `filename` or `filename*`. Is it JavaScript? – Wiktor Stribiżew Dec 03 '20 at 13:18
  • @WiktorStribiżew Yep Javascript, so no lookaheads :( – Wilt Dec 03 '20 at 13:20
  • Check https://regex101.com/r/l5zTvl/1/. BTW, lookaheads have been supported for years in JS regex. Lookbehinds are now supported in the majority of JS environments. – Wiktor Stribiżew Dec 03 '20 at 13:28
  • 2
    @Wilt Lookbehinds are the ones that are not universally supported. – MonkeyZeus Dec 03 '20 at 13:30
  • Oops, my bad... didn't want to start a discussion about that. – Wilt Dec 03 '20 at 14:25
  • So, does ``(?:.*filename\*|filename)=(?:([^'"]*)''|("))([^;]+)\2(?:[;`\n]|$)`` solve the issue? – Wiktor Stribiżew Dec 03 '20 at 14:48
  • I haven't had time to check thoroughly yet, I will get back to you. Thank you for your time and effort! – Wilt Dec 03 '20 at 16:08
  • @WiktorStribiżew Highly appreciated, and yes it does. But I am not sure why exactly it does work. How does this force `filename*` over `filename`, is it because of the order in the group with options? Would you be so kind to shortly explain what the logical thought behind this solution is in an answer. Then I can accept it also and close the question. – Wilt Dec 03 '20 at 19:06

1 Answers1

1

Assuming there is only a single match expected, and the match is expected to be the last match occurrence, you can use

(?:.*filename\*|filename)=(?:([^'"]*)''|("))([^;]+)\2(?:[;`\n]|$)

See the regex demo.

The part I modified is the one before =, note that the part after = might also need adjusting, but this is not the point here.

The (?:.*filename\*|filename) non-capturing group contains two alternatives:

  • .*filename\* - any zero or more chars other than line break chars, as many as possible, and then filename* substring
  • | - or
  • filename - just a filename substring.

Why it works:

  • The regex engine starts parsing the string from left to right
  • The non-capturing group pattern is triggered and the first alternative is tried
  • .*filename\* will match if there is filename* anywhere to the right of the current location
  • If there is no filename* the second alternative, filename, will be searched for at every location in the string, and once found, it will get matched. Else, there'll be no matches at all.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • After your "note that the part after = might also need adjusting" I have to ask of course how you would suggest improving that part. I will buy you a coffee :P – Wilt Dec 03 '20 at 20:13
  • @Wilt I will update the answer as soon as I can. I understand you are using JavaScript and you want to extract the values, but I am just not sure of what exact requirements you have for the part after `=`. I have already suggested [this regex](https://regex101.com/r/l5zTvl/1/) in the comments and you can see the `(?:([^\s"]*)|"([^"]*)")` part either captures any chars other than whitespace and `"` into Group 2 or any text inside double quotation marks into Group 3. I see you also want to handle the cases with single quotation marks, so that can be improved further. – Wiktor Stribiżew Dec 03 '20 at 21:02
  • Take your time :) I already appreciate that you will have another look. Just do it when you feel like it! – Wilt Dec 03 '20 at 21:05
  • @Wilt I would probably use [this](https://jsfiddle.net/wiktor_stribizew/ft6xu3mp/). The [regex](https://regex101.com/r/mmloYp/1) is a bit more linear here, and you do not have to rely on the ECMAScript specific behvior to initialize all groups with empty strings when they do not participate in the match. – Wiktor Stribiżew Dec 03 '20 at 23:10
  • 1
    Thanks! I added some test cases in a spec file and they all pass nicely. Have a great weekend! – Wilt Dec 04 '20 at 12:36