Unicode security issues on PHP
The last 3 days I put some other things by side to work on this here: A couple of unicode issues on PHP and Firefox. As one can see, securing web applications is also about knowing and understanding how data is coded and converted. To me it was obvious I had find out how to cope with this problems inside the security library SSEQ-LIB.
What are these vulnerabilities about
It’s once again all about not checking / encoding user input, which we all know that it’s evil. Let’s learn something about Overlong UTF-8:
Overlong UTF-8 (non-shortest form)
First go read what sirdarckcat wrote about overlong UTF-8!
So you’re back? Let us understand how an attacker can manufacture such an overlong UTF-8: We take for example the apostrophe (‚). Converted to binary it is 00100111. (Actually we put the numeric char code from apostrophe into the converter (see asciitable) which is 39.)
So we now have this binary string 00100111 which means 39 which itself corresponds to apostrophe. Now we are going to make it overlong. But before we must have a look at how UTF-8 is coded. Look exactly at this binaries in the columns Byte 1 to Byte 4: The first ones and zeros are very important because they tell the UTF-8 decoder how long the entire character is and which byte belongs to it.
Ok, back again? So we want to enlarge a UTF-8 char by one more byte, so we look in the second row of the table from wikipedia: the first bit has to start with „110“ because it means that this UTF-8 char is 2 byte long. The rest we can fill with zeros: 11000000. So we have the first Byte.
The second byte has to carry the initial value of 39 which is our apostrophe. We already know that 39 in binary is 00100111. Too bad that this string does not correspond with the UTF-8 definition for second bytes: it has to start with „10“. Well actually we replace the first 2 bit with „10“ and we’re done! Out second byte is: 10100111.
We put them together: 11000000 10100111
We convert each to hexadecimal: \xc0 \xa7 or url encoded: %c0%a7.
What’s wrong with overlong UTF-8
It is known, that interpreting non-shortest form UTF-8 is a security issue. Unfortunately PHP does interpret this overlong UTF-8. This is not a security breach by default. The point is that other software like web application firewalls, vulnerability scanners and even functions like „addslashes()“ does not interpret this overlong UTF-8 code and so attack vectors or chars which should be escaped can pass by unidentified.
So when you escapes database input like this:
<br />
<?php
$name = utf8_decode(addslashes($GET_['name']));
mysql_query("SELECT * FROM table WHERE name='$name';");
?><br />
Or when you rely on „magic_quotes“ (and you should not!):
<br />
<?php
$name = utf8_decode($GET_['name']);
mysql_query("SELECT * FROM table WHERE name='$name';");
?><br />
Just hope that no one inserts as „name“: %c0%a7%20OR%201%2F%2A which would result in something to ask for all users in the database:
<br />
SELECT * FROM table WHERE name=“ OR 1/*‘<br />
To sum up: addslashes() and „magic_quotes“ are not capable to interpret this overlong UTF-8 so it passes by without escaping.
What can we do about it?
I spent some time to figure out how to check if non-shortest UTF-8 data contains potentially dangerous payload or not. Finally the most precise solution seems to me to be counting the special chars before and after „utf8_decode()“. The reason why this works is that this kind of attack is based on infiltration of additional special chars which are kept hidden until they are revealed through „utf8_decode()“. So after decoding we should count some more special chars than before.
When encoding to an inappropriate encoding like from UTF-8 to iso-xxxxxx-x some characters have to be replaced by a question mark (?). This question mark we must not count.
This function tells apart potentially dangerous overlong UTF-8 from harmless overlong UTF-8:
<br />
<?php
function seq_mb_count_symbols_($string_ = '') {
$count = 0;
for ($i=0; $i < mb_strlen($string_, 'UTF-8'); $i++){
$ch = mb_substr($string_, $i, 1, 'UTF-8');
if (ord($ch) != 64 && (
(ord($ch) >= 33 && ord($ch) <= 62) ||
(ord($ch) >= 91 && ord($ch) <= 96) ||
(ord($ch) >= 123 && ord($ch) <= 126))
)
{
$count++;
}
}
return $count;
}
function seq_check_nonshortest_utf8_($string_ = '') {
$count = seq_mb_count_symbols_(stripslashes($string_));
$after_count = seq_mb_count_symbols_(utf8_decode(stripslashes($string_)));
if ($after_count > $count) {<br />
return true;<br />
}</p>
<p> return false;<br />
}<br />
?><br />
Check if string is dangerous:
<br />
<?php
if (seq_check_nonshortest_utf8_($string)) {
// yes, string contains hidden special chars. now take some appropriate action.
return false;
}
?><br />
Tell me if it works for you too, especially when your OS has some special encoding.
This injection only works if you have something like utf8_decode(addslashes($my_var)). addslashes(utf8_decode($my_var)) doesn’t trigger the problem.
It has been known for a long time that addslashes dpesn’t protect against Unicode characters. The real solution is obviously to use prepared statements (but I don’t know if you can do that in a WAF).
Salut Geoffroy,
thanks for summing up the buggy code. The problem with addslashes and Unicode may be well known but I missed some idea on how to find out if the code has some „hidden“ characters or not.
Besides a WAF could warn you if it detects these „hidden“ characters. Moreover it could check input data for allowed type and length and so act in part like prepared statements does. By the way, both things will come with the new version of SSEQ-LIB – the „in-code“ web application firewall.
I think the Unicode website has a list of allowed characters that you could check. But that doesn’t change the fact the addslashes doesn’t support utf8, and sometimes will try to escape characters that don’t need escaping.
I didn’t do the exhaustive test with mysql_real_escape_string (which supports multiple encodings) and MySQL enconding configuration, but I think a good solution could lie there (tweaking MySQL parameters so that mysql_real_escape_string will only escape the dangerous characters while decoding utf8).
I think the Unicode website has a list of allowed characters that you could check. But that doesn’t change the fact the addslashes doesn’t support utf8, and sometimes will try to escape characters that don’t need escaping.I didn’t do the exhaustive test with mysql_real_escape_string (which supports multiple encodings) and MySQL enconding configuration, but I think a good solution could lie there (tweaking MySQL parameters so that mysql_real_escape_string will only escape the dangerous characters while decoding utf8).
+1