8

Obviously $data is the string and we are removing the characters that satisfy the reg expression, but what characters are being specified by /[\xF0-\xF7].../ ?

 preg_replace('/[\xF0-\xF7].../', '', $data)

Also what what is the significance of these characters being replaced?

Edit for bounty: specifically, what exploit is this trying to prevent from occurring? The data is later used in mysql queries (non-pdo), so I presume some kind of injection attack is involved with these characters perhaps? Or not? I am trying to understand the logic behind this line of code in a script I am reading.

Sam Dark
  • 5,291
  • 1
  • 34
  • 52
user1796995
  • 313
  • 4
  • 17
  • 2
    Matches a range of characters from `xF0` to `xF7` – NullUserException Nov 30 '12 at 23:03
  • But what is the significance of these characters? – user1796995 Nov 30 '12 at 23:04
  • 7
    `ð ñ ò ó ô õ ö ÷` – NullUserException Nov 30 '12 at 23:04
  • @user1796995 Did you even try this? – Ruan Mendes Nov 30 '12 at 23:07
  • 2
    I mean, why would you want to escape these characters? are they unsafe? – user1796995 Nov 30 '12 at 23:08
  • 1
    @user1796995 They are difficult to type, and can be interpreted incorrectly if typed in. Using the escapes ensures that PHP is fed the *exact* characters that are intended. – Sammitch Nov 30 '12 at 23:13
  • @NullUserException Why don't you put a nice answer ? – HamZa Nov 21 '13 at 13:28
  • @user1796995, how is the data used when it is pulled out of the database? It may, simply, be a decision made by the programmer to not allow foreign characters. – Andy Jan 13 '14 at 21:12
  • @user1796995 It seems like you didn't read the comments with your recent edit. `[\xF0-\xF7]` is a range which will match one of the following characters `ð ñ ò ó ô õ ö ÷` like NullUserException said. So there is no "exploit" prevention or anything. It really seems useless but depending on the context it might actually have a goal. That goal, only you know it since you know the "context". – HamZa Jan 13 '14 at 21:16
  • I did read the edits. As I tried to explain, I want to know the *purpose* of escaping those characters. Whether it's against an exploit or something completely different. – user1796995 Jan 13 '14 at 21:18
  • 2
    @user1796995 It isn't escaping at all and if I asked you `Would removing some accented letters prevent some exploits ?` what would be your response ? It's very likely that it would be possible. Anyways there are edge cases that we could never imagine, see [this answer](http://stackoverflow.com/a/12118602). Finally, I don't really see your point. If you want to improve security, then just properly use prepared statements. One cannot guarantee anything from just removing few characters. – HamZa Jan 13 '14 at 21:25
  • Re-read my last comment. Cheers. – user1796995 Jan 13 '14 at 21:27
  • 1
    @user1796995 If you seriously want an answer, then you should probably add more information: Maybe a background story ? What encoding does your DB use ? What encoding(charset) does the mysql(i) connection use ? Reread those comments again. There is no escaping at all in that line of code. – HamZa Jan 13 '14 at 21:38
  • Clearly characters are being removed from data. What possible reasons this could be for is why I am asking the question. I have suggested it could be to escape data from being used in a query. Or it could be some other reason. *If I knew the reason I would not be asking the question*. Please stop cluttering up my comments section repeating yourself over and over. You've made your point, you don't like my question. – user1796995 Jan 13 '14 at 21:51
  • You can see the result here : http://regex101.com/r/lR0hS9 – mpgn Jan 19 '14 at 09:11
  • 1
    Great example of how comments are useful for explaining the "why" when the "what" is obvious. – Fuhrmanator Jan 20 '14 at 18:25

3 Answers3

20

It removes 4 byte sequence from unicode string. In these first byte is always [\xF0-\xF7] and three dots are the rest of 3 bytes.

According to http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html:

The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters.

MySQL with utf8 encoding selected may truncate text at the point where the sequence appears and if error reporting isn't set to strict_trans_tables it may do it silently instead of throwing errors like SQLSTATE[HY000]: General error: 1366 Incorrect string value:.

See these for further reference:

Potentially truncating can lead to exploit.

For example, there is a website with user named admin. Website allows anyone to register. Using truncated strings one probably will be able to insert another admin with different email bypassing unique check. Then suspend account and try using restore procedure. It will issue a query like SELECT * FROM users WHERE name = 'admin' and since original admin is the first record attacker will restore his password.

Sam Dark
  • 5,291
  • 1
  • 34
  • 52
1

It's matching one of 8 byte values, plus any 3 characters following, and removing the block of 4 characters. That much you say you know already. Unfortunately, without more context, we can't tell you why these particular 8 bytes are significant. By themselves, they're harmless, regardless of what character glyph they stand for (character encoding). My best guess is that in the application this comes from there is some significance to these 8 characters as markers of some kind. 0xF0 is 11110xxx, the first byte of a 32 bit (4 byte) UTF-8 character, so perhaps it is to remove all 32 bit UTF-8 characters? Are 16 and 24 bit characters (110xxxxx and 1110xxxx first byte) similarly treated?

Phil Perry
  • 2,126
  • 14
  • 18
  • Check Sam Dark's answer. You're along the same lines removing all 32 bit UTF-8 characters. He's nailed it I think. – user1796995 Jan 13 '14 at 23:37
  • Yeah, I ended up posting 13 seconds after he did. I was unaware of MySQL truncating 32 bit characters (apparently, not shorter ones). Of course, this usage is dependent on PHP actually treating the 32 bit UTF-8 character as 4 individual bytes (or at least, allowing such access). – Phil Perry Jan 13 '14 at 23:43
0

preg_replace('/[\xF0-\xF7].../', '', $data) replace:

(xF0 to xF7) + three next characters with an empty string (the + symbol represents concatenation, not addition)

Manolo
  • 24,020
  • 20
  • 85
  • 130