1

I have an Url (https://example.com?&iframeLoad=true&firstName=&lastName=&email=&phone1=&address=&zipcode=07307&isAvailableReferral=true&isAvailableDirect=false)

I am trying to replace the firstname, lastname, email, phone, address fields and not the other ones.

This is what I am currently doing using regex (&?(firstName|lastName|email|phone1|address)=?[^&]*)

This basically selects "&" followed by firstName|lastName|email|phone1|address and also every character after the "=". Notice the regex does not match if the "=" is followed by "&" symbol.

I am able to select every field correctly but when the URL has a "&" after "=" my solution does not work correctly as it only selects the value till "&" character.

As a valid email can have a "&". I need a solution where the regex selects even when there is "&" symbol after the "=".

example: &email=abc&xyz@.com - in this case, the regex only selects "&email=abc&" and not entire email.

PrasadPatil
  • 33
  • 1
  • 7

2 Answers2

2

Depending on the url encoding's specifications, this task may be impossible to accomplish unambiguously. In order for this to be possible, the urls in the dataset must be standardized such that every parameter has an equal sign after it, and there must be no other stray equal signs in the parameter values. If both of these conditions are true, then the following will work:

The regular expressions

&(firstName|lastName|email|phone1|address)=([^&]*(?:&[^&=]+(?=&|$))*)

Also note this regular expression does not cover cases where one of the desired parameters is the first parameter. Because Javascript regex is limited, and this is a special case anyway (beginning with ? instead of &), this will need to be handled differently, depending on what you want to do with the parameters. Matching the following and replacing with ? is a way to remove the parameter:

\?(firstName|lastName|email|phone1|address)=([^&]*(?:&[^&=]+(?=&|$))*)(?:&|$)

If you aren't planning on completely removing the parameter, the (?:&|$) at the end of the expression can be removed for simplicity.

Depending on what you plan on replacing the parameters with, you may find it useful to tweak the expressions, but these should generally give the desired output within the above rules.

How it works

The trick here is to have a separate non-capturing group (?:&[^&=]+(?=&|$))* that handles additional parts of the parameter string with raw ampersands but no equal sign. The character class [^&=]+ ensures that the subexpression doesn't have ampersands or equal signs, and the lookahead (?=&|$) ensures that the string is followed by another parameter or the end of the string, not an equal sign. The whole group has a quantifier *, since it can appear zero, one, or multiple times after the initial parameter.

Also note for convenience, the values for the parameter name and value are stored in capturing groups 1 and 2, for easy access and parsing. If you aren't planning on using the values, they can be replaced with non-capturing groups by adding a ?: after the (.

Disclaimer

If any parameters are missing the equal sign, there's no way to unambiguously disambiguate new url parameters from values for the previous url parameter, since in the example https://example.com?&iframeLoad=true&email=abc&xyz@.com, this could either be referring to one parameter named email with the value abc&xyz@.com, or two parameters named email and xyz@.com (unless both the list of parameter strings and the list of value strings are standardized, but down this road lies madness). In a similar way, random equal signs trick the parser. As @David Faber mentioned, typically a & character in a URL would be URL-encoded as %26, to prevent this ambiguity entirely.

Graham
  • 3,153
  • 3
  • 16
  • 31
0

You may want to consider something like this:

[&?]((?:firstName|lastName|phone1|address|zipcode)=|email=(?:.*@.*\.)?)[^&]*

The email parameter is handled as a special case here - we check for a local part followed by subdomain(s), while allowing for a TLD without an ampersand (I believe this is safe - I don't think a TLD can contain odd characters like that). All other parameters are handled normally. The matches will be returned as name=value pairs. See Regex 101 here.

David Faber
  • 12,277
  • 2
  • 29
  • 40
  • I would not recommend this approach. To begin with, [email address validation can be complicated](https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression/201378#201378). Even if you want to go with this approach, you should avoid the `.*` construction both for efficiency and because of [potential failure cases](https://regex101.com/r/0tJ6IU/2). This solution is also insufficient for the OP's requirement of *replacing* the text in the url: for a replacement, knowing the first character is vital, but there's no way to know it from this expression. – Graham Jun 09 '18 at 17:21
  • I am definitely not suggesting any kind of email validation here. I agree that `.*` is probably best avoided. – David Faber Jun 09 '18 at 17:24
  • 2
    Generally, I think *any type* of email validation is a bad approach here because it assumes the input is a legal email, and we can't necessarily assume that. However, if you *are* using a (basic) email validation approach, and you're allowing replacement, the regex you really wanted was [`([&?])((?:firstName|lastName|phone1|address|zipcode)=|email=(?:[^@]*@)?[^&]*)`](https://regex101.com/r/0tJ6IU/3). Additionally, I used two separate regexes in my solution because deletion handling will be different for the first parameter. – Graham Jun 09 '18 at 17:26
  • @Graham, thanks, that is better than my initial regex – David Faber Jun 10 '18 at 12:05