Types of Security Mechanisms

Web applications receive remotely supplied (and potentially malicious) user data via HTTP request parameters. In PHP applications, user input is always received as data of type string. If this input is processed in a security sensitive operation of the web application, a vulnerability can occur. For example, a dynamically built SQL query with embedded user input may lead to a SQL injection vulnerability. In order to prevent such a critical vulnerability, the user input has to be sanitized or validated beforehand. For this purpose, a security mechanism is applied between the user input (source) and the sensitive operation (sink) such that malicious data cannot reach the sensitive operation.

1. Generic Input Sanitization

Generally speaking, during input sanitization the data is transformed such that harmful characters are removed or defused. The advantage of this approach is that relevant parts of the data stay intact, while only certain characters are removed or replaced. This way, the application can proceed with the sanitized data without a request for resubmission. In the following, we present several ways to sanitize data generically against all type of injection flaws, such as cross-site scripting and SQL injection.

1.1. Explicit Type Casting

Numeric characters can be safely used in security sensitive string operations. To ensure only numeric characters, a string can be explicitly typecasted to a number by using the typecast operator or built-in functions.

$var = (int)$var;        //safe
$var = intval($var);     //safe
settype($var, 'int');    //safe

Listing 1: Examples for explicit type casting.

All three operations in Listing 1 ensure a secure use of the variable $var regarding injection flaws. PHP uses duck typing to determine the integer value of a string. An empty and non-empty string is typecasted to the number 0. However, if the string starts with a number (“123abc”), the number is used as result of the typecast (123 ). We will introduce pitfalls associated with duck typing later on.

1.2. Implicit Type Casting

Similar to an explicit typecast, an implicit type cast automatically occurs if data is used in mathematical operations. Listing 2 shows an addition in line 1 in which $var is safely typecasted to integer before a number is added.

$var = $var + 1;    // safe
$var = $var++;    // unsafe

Listing 2: Examples for implicit type casting.

In contrast, the increment operator in line 2 performs no typecast and also works on strings. For example, the last character in the string aaa will be incremented to aab. Thus, $var can still contain malicious characters.

1.3. Formatting

Type casting is also performed by PHP’s built-in format string functions. Different specifiers (beginning with a percentage sign) can be used in a format string that determine the data type of the data they will be replaced with. An example is given in Listing 3 that uses the identifier %s (string) and %d (numeric).

$var = sprintf("%s %d", $var1, $var2);  // unsafe / safe

Listing 3: Sanitization with a format string function.

The argument $var1 is unsafely embedded to the string assigned to $var. Contrarily, $var2 is safely typecasted to integer before it is embedded to the string.

1.4. Encoding

Exploitation of injection flaws almost always requires special characters. Thus, next to numbers, alphabetical letters can be considered to be a safe character set. By encoding data to an alphanumeric character set, the data is sanitized. Listing 4 provides a few encoding examples.

Although the base64 and url encoding introduces a few special characters (+, /, =, or %), they are generally not sufficient to form a malicious payload and these encodings can be considered as safe when used in a sensitive sink. Other encodings, however, include the full set of ASCII characters in the transformed output and thus are unsafe to use in sinks. Specifically, the transformation or decoding to the original data is unsafe because it reanimates malicious characters.

$var = base64_encode($var); // safe
$var = urlencode($var); // safe
$var = zlib_encode($var, 15); // unsafe
$var = urldecode($var); // unsafe

Listing 4: Transforming data into different encodings.

1.5. Filtering

It is also possible to sanitize data by built-in filter functions. If the data passes a filter, it is returned unmodified. Otherwise, false is returned so that the function can also be used for input validation (see Section 3). Listing 5 demonstrates the usage of two filter functions.

$var = filter_var($var, FILTER_VALIDATE_INT); // safe
$var = filter_var($var, FILTER_VALIDATE_EMAIL); // unsafe
$vars = array_filter($vars, 'is_numeric'); // safe
$vars = array_filter($vars, 'is_file'); // unsafe

Listing 5: Sanitization with a filter.

While the filter for integer/numeric values is safe, filtering for valid email addresses or files is not necessarily, because the character set of email addresses and file names allow special characters. For example the SQL injection payload 1’or’1’-@abc.com can be a valid email and file name.

2. Context-Sensitive Input Sanitization

In contrast to generic input sanitization, context-sensitive sanitization removes or transforms only a small set of special characters to prevent exploitation of a specific vulnerability type or a subset of vulnerabilities. Therefore, sanitized data may still cause a vulnerability when used in the wrong markup context or another type of sensitive sink. Again, we provide in the following examples for security mechanisms and common pitfalls inspired by real-world code we found.

2.1. Converting

A common method to distinguish between HTML markup characters and data is to convert markup characters within data to HTML entities. In Listing 6, the built-in function htmlentities() is applied to different HTML contexts.

$var = htmlentities($var);
echo '<a href="abc.php">' . $var . '</a>'; // safe
echo '<a href="abc.php?var=' . $var . '">link</a>'; // safe
echo "<a href='abc.php?var=" . $var . "'>link</a>"; // unsafe
echo "<a href=abc.php?var=" . $var . ">link</a>"; // unsafe
echo '<a href="' . $var . '">link</a>'; // unsafe

Listing 6: Converting meta characters to HTML entities.

The function htmlentities() converts the < and > character to the entity &lt; and &gt;, as well as the doublequote character to &quot;. Thus, the data is safely used in line 2, where no new HTML tag can be opened with a < character, and in line 3, where no double-quote can be used to break the href attribute. However, if single-quotes (line 4) or no quotes (line 5) are used for the attribute, an attacker can inject eventhandlers to execute JavaScript code. In line 6, double-quotes are used and cannot be broken, but a javascript: protocol handler can be injected at the beginning of the URL attribute and craft a malicious link.

2.2. Escaping

In SQL markup, string values are escaped in order to prevent breaking the quotes the value is embedded in. A prefixed backslash before a quote tells the SQL parser to interpret the next quote as data instead of syntax.

$var = addslashes($var);
$sql = "SELECT * FROM user WHERE nr = '" . $var . "'"; // safe
$sql = 'SELECT * FROM user WHERE nr = "' . $var . '"'; // safe
$sql = "SELECT * FROM user WHERE nr = " . $var; // unsafe

Listing 7: Escaping data for a SQL query.

In Listing 7, a value is escaped with the built-in function addslashes(). It prevents breaking a single- or double-quoted string value (line 2 and line 3). However, when no quotes are used in the SQL query (line 4), breaking quotes is irrelevant and an attacker can inject SQL syntax.

Furthermore, truncating a string after it was escaped introduces a security risks. If the string is truncated at an escaped character, a backslash remains unescaped at the end of the string that breaks any upcoming quote in the query.

2.3. Preparing

A safer way to separate data and SQL syntax is to use prepared statements (see Listing 8). Here, the SQL statement is prepared with place holders for parameters. Data can then be bound to each place holder which will be safely inserted at runtime, regardless of quoting or data type.

$stmt = $db->prepare("INSERT INTO " . $pfx . "user (id, name)
                    VALUES (?, ?)");
$stmt->bind_param('i', $var); // safe
$stmt->bind_param('s', $var); // safe

Listing 8: Binding parameters to a prepared statement.

Note that if the SQL statement is prepared dynamically, it is still vulnerable to SQL injection. In line 1, the table prefix variable $pfx can still inject SQL syntax. Another pitfall to be aware of is that the inserted name to the table user can still cause a second-order vulnerability.

2.4. Replacing

Manual replacing of certain characters is error-prone in practice. In Listing 9, two ways of replacing single-quotes are shown that look safe at first sight.

$var = str_replace("'", "", $var); // unsafe
$var = str_replace("'", "\'", $var); // unsafe
$sql = "INSERT INTO user VALUES ('" . $var . "','" . $var . "')";

Listing 9: Two examples for manual escaping.

In line 1, single quotes are removed completely and in line 2 they are escaped with a backslash. However, the backslash itself is forgotten in both replacements. Hence, a backslash can be injected to break the single quotes. The second replacement will replace “\’” to “\\’”, which escapes the backslash and leaves the single quote unescaped.

2.5. Regex Replacing

Regular expressions (regex ) can be used for string replacement and are error-prone if not specified carefully. For example, in Listing 10, all characters except for those specified in brackets shall be removed to ensure safe data output.

$var = preg_replace("/[^a-z0-9]/", "", $var); // safe
$var = preg_replace("/[^a-z.-_]/", "", $var); // unsafe
echo $var;

Listing 10: String replacement with regular expressions.

The first regular expression allows alphanumerical characters. The second regular expression could intent to allow lowercase letters as well as the dot, minus, and underscore character. However, the full ASCII range between the dot and underscore character is allowed, including the character < and > that allow to inject HTML.

3. Generic Input Validation

Next to input sanitization that transforms data into a safe character set, data can be simply refused if it does not hold a condition or fails a check. This input validation ensures that only data which already consists of a safe character set reaches a sensitive sink and data containing malicious characters is refused. In the following, we introduce generic conditions and checks to validate data against all type of injection flaws we empirically found during our analysis.

3.1. Null Validation

The easiest way to validate that no malicious character is within a given string is to check if it is empty or not set (see Listing 11). However, this also implies that no data can be used. A null validation is commonly used in combination with a previous unset() operation. A static code analysis tool should be able to calculate the boolean logic behind a not operator and multiple else or elseif branches (line 4).

if (empty($var)) { } // safe
if (!isset($var)) { } // safe
if (!$var) { } // safe
if (empty($var)) { } else { } // unsafe

Listing 11: Validating a variable’s initialization.

3.2. Type Validation

Validation can also be performed by checking the data type. Listing 12 shows four examples that check for a numeric data type. In line 3, PHP’s duck typing is used when a string is provided for an integer typecast. According to its rules, the typecast result of a string that starts with a number will bypass the validation. The same applies to the validation in line 4, however, $var is sanitized by overwriting it with the typecast result.

if (is_numeric($var)) { } // safe
if (is_int($var) === true) { } // safe
if ((int)$var) { } // unsafe
if ($var = (int)$var) { } // safe

Listing 12: Validating a variable’s type.

3.3. Format Validation

Next to the data type, a specific data format can be enforced. For example, the time and date format ensures that no malicious payload can be crafted with the given set of characters (see Listing 13). Other formats, however, might allow malicious characters, such as parts of the URL format.

if (checkdate($var)) { } // safe
if ($var = strtotime($var)) { } // safe
if ($vars = parse_url($var)) { } // unsafe

Listing 13: Validating a variable’s format.

3.4. Comparing

By comparing input against a specific non-malicious value, the data is implicitly limited to this value. In PHP, this can be done by the equal operator, the identical operator, or built-in functions (see Listing 14).

if ($var == 'abc') { } // safe
if ($var === 'abc') { } // safe
if (!strcmp($var, 'abc')) { } // safe
if ($var == 1) { } // unsafe
if ($var === 1) { } // safe

Listing 14: Validating a variable’s string content.

Care should be taken when using the equal operator (==, line 4). It performs a type unsafe comparison by using duck typing on operands. Therefore, any string starting with the number 1 is typecasted to the integer 1 when compared with an integer. Thus, malicious characters in this string bypass the comparison to 1. A type safe comparison is performed with the identical operator (===).

3.5. Explicit Whitelisting

To compare input against a set of whitelisted values, an array can be used as lookup table, as shown in Listing 15. The lookup can be performed either by array key (line 2 to line 4) or array value.

$whitelist = ['a' => true, 'b' => true, 'c' => true];
if (isset($whitelist[$var])) { } // safe
if ($whitelist[$var]) { } // safe
if (array_key_exists($var, $whitelist)) { } // safe
if (in_array($var, ['a', 'b', 'c'])) { } // safe
if (in_array($var, [1, 2, 3])) { } // unsafe
if (in_array($var, [1, 2, 3], true)) { } // safe

Listing 15: Using an explicit whitelist for validation.

Looking up a value in an array applies to the same rules than comparing two values with the equal operator. Thus, the example in line 6 is unsafe because the string 1abc is typecasted to 1 and found successfully in the array. To avoid this, the strict parameter has to be set to true. Similar pitfalls occur when using the built-in function array_search().

3.6. Implicit Whitelisting

Next to an array, a value can be compared against a fixed set of items. For example, method and property names are limited to an alphanumerical character set . If a value matches one of these (method_exists()), it implies that no malicious character is contained.

3.7. Second-order Validation

Similar to a whitelist, a value can be looked up in a resource, such as the file system or a database. Listing 16 shows an example where an email is looked up in the table user. Only if a user with the email address exists, the path is reached. Similarly, three additional examples show a check for the presence of a file name.

$var = addslashes($var);
$r = mysqli_query("SELECT * FROM user WHERE mail='$var'");
if (mysqli_num_rows($r)) { }
if (file_exists($var)) { }
if (realpath($var)) { }
if (stat($var)) { }

Listing 16: Database and file name lookup.

The safety of the validation depends on the present values in the database or available file names. If the application allows to insert arbitrary email addresses to the database or to upload arbitrary file names, the validation is unsafe.

4. Context-Sensitive Input Validation

Input validation can also be performed context-sensitively: for a subset of vulnerability types, the data is validated against a safe character set or the absence of malicious characters regarding the vulnerability type and markup context in a specific code path. Another vulnerability type or another markup context within the same path may still be exploitable. In the following, we introduce examples for context-sensitive input validation we found in our study.

4.1. Searching

For a specific context, user input can be validated by proofing the absence of a malicious character required for exploitation. For example, if no < character is found in the input, it can be considered as safe regarding XSS in the context of a HTML tag. Two typical search examples are shown
in Listing 17.

if (!strpos($var, '<')) { } // unsafe
if (strpos($var, '<') === FALSE) { } // safe

Listing 17: Searching for a specific malicious character.

The first example is unsafe, because strpos() returns the offset at which the character was found in the string. If the string starts with a < character, offset 0 is returned that evaluates to false in the if-condition. Thus, the first validation can be bypassed.

4.2. Length Validation

Note that a specific string length can or cannot prevent exploitation, depending on the vulnerability type and its markup context. For example, on MySQL, the SQL injection of the three characters ’-’ is equal to a ’or’1’=’1 injection. For a XSS vulnerability, three characters are usually not enough for exploitation. Thus, a string length validation, as shown in Listing 18, is context-sensitive.

if (strlen($var) < 3) { }

Listing 18: Validating the length of a variable.

4.3. Regular Expressions

Regular expressions are a useful tool to perform very precise input validation. In Listing 19, three different examples are shown to allow only alphanumerical characters in the following path.

if (!preg_match('/[^\w]/', $var)) { } // safe
if (preg_match('/\w+/', $var)) { } // unsafe
if (preg_match('/^\w+$/', $var)) { } // safe

Listing 19: Validating the character set with regex.

The first example ensures, that no characters are present except for alphanumerical (\w) characters. The second example checks that alphanumerical characters are present. However, it fails to check the complete string range due to the missing boundary checks (compare to line 3). Hence, one alphanumerical character at any position of the string is enough to bypass the validation. More pitfalls regarding regular expressions can be found in Section 2.5.

5. Path Sensitivity

A security mechanism can also be spread across multiple paths of the control flow. In this case, path-insensitive code analysis reports false positives when impossible path combinations are considered. In the following, we present examples for path-sensitive applications of security mechanisms and outline the challenges for static code analysis.

5.1. Path-sensitive Sanitization

In Listing 20, the variable $var is implicitly sanitized by first checking for a numerical data type. If this condition does not hold, the variable is sanitized. For example, in line 2, the variable is set to the integer 0 which effectively limits the variable to numerical characters for all paths after the if-block. Similarly, the variable could be unset (line 3) or context-sensitive sanitization could be applied (line 4).

if (!is_numeric($var))
    $var = 0;
    //$var = addslashes($var);

Listing 20: Path-sensitive sanitization.

Typically, static code analysis tools fail to recognize this type of input sanitization because all execution paths are considered separately. Thus, it is assumed that the variable is not modified when the if-path is not taken. However, this implies that the variable’s value is already numerical.

5.2. Path-sensitive Termination

A similar confusion of static analysis can occur when the program is terminated based on input validation. In Listing 21, the program execution is halted if $var is not numerical. Alternatively, a loop could be aborted (break) or the control-flow of a user-defined function ended (return).

if (!is_numeric($var))
    die('not numeric.');

Listing 21: Path-sensitive program termination.

A static analysis tool should not only be aware of the fact that there is no jump from the if-block to the following code, but also that the conditional termination of the program prevents any non-numerical characters after the if-block. Considering more complex code and the halting problem, which proves the undecidability of all program halts with another program, it is evident that static code analysis cannot reason about all security mechanisms correctly.

5.3. Path-sensitive Validation

Another challenge for static analysis is path-sensitive usage of input validation. A typical example is given in Listing 22, where the variable $error is used to flag bad input.

if (!is_numeric($var))
    $error = true;
if (!$error) { }

Listing 22: Path-sensitive validation.

The variable $error is independent from the variable $var that is analyzed for tainted input. Thus, its relevance for input validation is likely missed by path-insensitive static analysis. In contrast, analyzing all variables in all conditions of an execution path for input validation is very expensive for long paths and inter-procedural data flow.