Assessing Static Application Security Testing Tools With Synthetic Applications
The challenges of SAST tool evaluation
As a technical specialist working at an Application Security vendor I have helped hundreds of customers in their evaluations of Static Application Security Testing (SAST) tools. The application security market has matured to a point where the overwhelming majority of evaluations are run because of dissatisfaction with existing tools, rather than an outright new technology purchase. There are some very common tool requirements that my customers are looking for: ease of use for developers, integration into 3rd party tools and dashboards, analysis speed, programming language coverage, and what is often a primary concern - accuracy or, as it is sometimes phrased (incompletely as we will see below), ‘low false positive count’.
How to measure the effectiveness of a tool can be debated, but in most conversations I have had it comes down to how much risk reduction occurs as a result of using the tool. Risk reduction is impacted by tool accuracy in two ways. ‘Precision’ is what most people think of when they discuss accuracy, defined as the proportion of true alerts in the total alert set, where a fraction closer to 1 is better (AKA false positive count). Accuracy also has a component of ‘recall’, defined as the proportion of true alerts detected from all possible true alerts, again a fraction closer to 1 is better (AKA false negative count).
The impact of false positives vs. false negatives
Precision can be thought of as the amount of wasted work the tool generates for your developers. Low precision, i.e. a high false positive count, is a motivation and trust destroyer. Low recall, i.e. a high false negative count, doesn’t cause wasted work in quite the same way as false positives, but it does destroy trust. Unfortunately this can happen when a tool does not detect a weakness which is then exposed later on through penetration testing or, worst case, through exploitation. False negatives are most often visible during evaluations when comparing one tool against another.
There is a balancing act for any SAST tool vendor between these two elements of accuracy, and a third element, analysis speed. A tool could be built for high precision but this could come at the cost of recall as the vulnerability detection rules are overly specific. A tool could be built for high recall at the expense of precision as the rules are overly broad. To achieve both precision and recall at fractions close to 1 will almost certainly come at the expense of analysis speed for a typical compute environment.
There is also a balancing act for you as a SAST tool user. A low false positive count is crucial for your developers to be happy to use the tool you choose, without it the tool will be rejected almost immediately. A low false negative count is usually needed to keep the security team happy, but here is where things get tricky, because it’s not the security team that actually fix vulnerabilities in code it’s your developers, and the trade off that the SAST tool vendor makes in their balancing act impacts the balance between satisfying your developer’s needs and your security team’s needs.
Ultimately this trade off can play out as ‘we chose the tool that produced the most results, but we fixed nothing because our developers hated using the tool due to overwhelming false positives’. Choosing a tool based solely off the raw result count happens more often than you might imagine.
Testing the balance during evaluation
The last thing your security team wants is for your company to be breached via an application vulnerability, and to find out during their post-mortem that some other tool would have found the issue before it made it into production. They also don’t want to buy a tool and then have it immediately shot down for overloading your developers with work.
These are legitimate concerns but, given no SAST tool vendor is going to tell you their product isn’t accurate, the only viable method of evaluation is to test each product in your short list against some well known code base and look for false negatives and false positives.
Testing against a well known code base means asking someone that is an active developer on the code base to participate in your evaluation. Someone likely working on a backlog of features, bug fixes and refactors that has been through several rounds of prioritisation and business commitments. So when you do ask them to participate it’s reasonably likely the amount of time they can dedicate to your request is small.
This presents a problem, because without a proper assessment of the SAST tool results there is no chance you can stand up and confidently say ‘we did our best’ should a breach occur. What does it mean to ‘do your best’? Well, the SAST tool you choose has to be accurate as we have discussed so far, but it also has to be accurate in the context of your company’s code.
What usually happens as a result of developer time constraints is your security team look for alternative well known code bases, or code bases that are seemingly easy to understand quickly. This is where ‘synthetic apps’ come into the picture. There are a number of synthetic apps available in the application security world, most of them designed as an educational tool first. OWASP lists 102 deliberately vulnerable applications across a variety of programming languages and technologies.
It is tempting to see these applications as the perfect solution to the known code base problem. You will usually be able to find a vulnerable application that matches your main technology stack, the source code is highly likely to be available on GitHub, the application architecture is unlikely to be complex, and they already contain a bunch of vulnerabilities that your security team and developers are familiar with. Unfortunately, in the context of evaluating a SAST tool to address your requirements, this approach is mistaken. Let’s look at why.
Lack of real world code relevance
Synthetic apps are designed to be easy to run and easy to add vulnerabilities to. They exist either to demonstrate the way in which a particular vulnerability class can be created in a particular programming language, or as a black box application which learners can experiment with to educate themselves about application vulnerabilities. They absolutely do not need to replicate a large scale enterprise architecture, and they have none of your company’s specific design and implementation practices baked in. In fact they are usually written and maintained by a small number of individuals, therefore resulting in a limited variety of code patterns.
SAST tool vendors spend a large amount of time creating rules that work in the context of their paying customers’ applications and the typical enterprise software patterns and practices used there. Synthetic apps are rarely, if ever, able to appropriately test these rules to their full extent.
The crucial question to ask yourself during your evaluation is ‘does this SAST tool work on our code, and with our application design and coding standards?’ That means testing on your code and your code alone. It also means testing for either known existing vulnerabilities, perhaps from prior SAST analysis or from a penetration test. Testing anything but your own code proves nothing of use.
One additional comment here, adding vulnerabilities to your own code deliberately in order to test a SAST tool is only marginally better than using a synthetic app, these code changes typically display the same issues as the vulnerabilities in synthetic apps.
Trivial or non-exploitable vulnerabilities
Following swiftly on from real world code relevance is the notion of ‘real world representation’ of vulnerabilities. The typical vulnerability in a synthetic app is either trivially constructed, or potentially not exploitable as there is no code execution path from a user controlled input (AKA the source) to the risky function call (AKA the sink). The former can often be seen as a function that both takes user input and also makes a database request all within the same function body. The latter could present as user controlled input that is actually a hardcoded literal string in the application code; so there could be a SQL query created via string interpolation, which is bad practice but not necessarily exploitable.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# A function that takes user input and executes a SQL query
@app.route("/hello")
def hello():
name = request.args.get("user")
with connection.cursor() as cursor:
cursor.execute(f"SELECT * FROM users WHERE username = '{name}';")
result = cursor.fetchone()
# A function that has no user controlled input
@app.route("/goodbye")
def goodbye():
name = "Jim"
with connection.cursor() as cursor:
cursor.execute(f"SELECT * FROM users WHERE username = '{name}';")
result = cursor.fetchone()
Tools that do a poor job of full control flow and data flow analysis (AKA taint tracking) can still do a good job of detecting vulnerabilities that exist within a single function body. The effort required to model intra-procedural control and data flow is much less than across a full application and in some cases a basic grep can identify weaknesses of this sort. Performing well on this type of issue using basic analysis methods can lead to a tool producing a lot of noise that ultimately frustrates your developers.
For vulnerabilities that are non-exploitable, using hardcoded ‘user input’ for example, there is often a conversation about why those ‘vulnerabilities’ were not detected. The process of evaluating a selection of SAST tools rarely lasts for more than a handful of weeks. Taking time to troubleshoot, understand, and explain why these issues should not be reported can significantly eat into the time available for the evaluation, and can seed unfounded doubt in the minds of your evaluation participants that are not close to the technical side of things.
Vendor tuning
The continued use of synthetic applications to evaluate SAST tools both by a vendor’s customers, and by consultants and researchers has led to vendors spending valuable engineering effort to improve accuracy on these applications, merely to achieve a better score on what we now know to be unrepresentative code bases. As a customer evaluating these tools synthetic applications you can take no transferable measurements from your analysis results. SAST vendors are literally seeing the exam questions months before the exam.
In conclusion
Testing SAST tools is a hard job to do really well in a limited amount of time, you need both detailed knowledge of your own code base and the particular configurations, rules, and quirks of the tools you are testing. We all want to have some confidence that the tool choice will be the right one, based on a low false positive and false negative count (in my opinion with a bias towards the false positive count), but it’s only meaningful to perform this analysis in the context of your own code as that is what you will be using the tool on day-to-day. Synthetic apps just won’t cut it and create a false sense of performance that can bite you later down the line. I hope this article gives you food for thought if you are considering the use of synthetic applications during your own SAST tool evaluation. These applications have their place but it’s important to remember their original design intentions are usually education and experimentation, not as a replica of your application code.