Improving Cyber Defense Machine Learning Through “Vaccination”

Spread the love

Artificial intelligence (AI) and machine learning (ML) are exciting concepts in all fields, but especially cybersecurity. The cybersecurity industry suffers from a significant shortfall in trained personnel and a rapidly growing number of attacks. By leveraging AI and ML, cyber defenders have the potential to close the gap and more efficiently defend against new attack vectors.

 

Web applications represent a significant potential attack surface for organizations, and web application security is a promising application of ML-based cybersecurity defense solutions.  Attackers are always developing new attacks, and ML-based systems have the potential to detect and block them.  However, developing a ML-based algorithm capable of operating effectively in an adversarial environment, with hackers actively attempting to defeat it, can be more difficult than it sounds.  The old saying “garbage in, garbage out” goes double for ML and AI.

 

The Challenges of Machine Learning

 

Machine learning (ML) and artificial intelligence (AI) have the potential to dramatically change the state of many different industries. Man-made algorithms for data processing require that the developer understands the system and the data that they want to process in order to write the code.

 

For example, if a developer wants to create a human-made algorithm for identifying pictures of blue cars, they would need to be able to define the color blue and what a car looks like and write this into the code. Then, when the software is presented with a picture of a blue car, it can properly identify it.

 

With ML and AI, a developer doesn’t need to have the same level of understanding and ability to generate code definitions to be effective. An ML-based algorithm is self-teaching. The developer’s responsibility is to present the algorithm with a training dataset containing images that are labeled as containing blue cars or not. The algorithm then extracts features of the images for itself, learning its own definition of what a “blue car” really is.

 

However, this responsibility of creating a “good” dataset for training the algorithm can be more difficult than it sounds, especially in the field of cybersecurity. Unusual data points, or “outliers”, can mess up a model’s training process. For example, in the rain a blue car may look gray, causing the model to think “gray” is also “blue”.

 

The same issue applies to ML applications for cybersecurity as well. Many businesses are creating ML-based cyber defense projects, but the data that these algorithms produce can be complicated. In many cases, developers are in a hurry and don’t properly “scrub” their datasets before training the algorithm. As a result, the algorithm is trained on data that labels benign events as malicious and vice versa, creating an incorrect model. These errors can create a false sense of security as the system misses cyberattacks while in use.

 

Alternatively, developers may go too far in the other direction by removing all outliers from a dataset. A very homogeneous dataset is easy to develop a model for and makes it possible to publish exceptional detection statistics in the marketing materials, but doesn’t work so well in the real world. Any natural variation in the data presented to the system will cause incorrect classifications.

 

“Vaccinating” Algorithms

 

The risk of a bad training dataset creating a bad ML model exist even in non-adversarial environments and are only exacerbated when used for cyber defense applications. In cybersecurity, hackers are actively trying to break into the network protected by an ML-based solution. As a result, they commonly will apply small amounts of “distortion” to the data being presented to the algorithm to fool the ML algorithm.

 

Solving this problem is the focus of research by CSIRO’s Data61 team. These researchers have acknowledged the potential impact of an adversary corrupting the data sent to an ML-based model and have decided to beat them to the punch. The study explores the impacts of deliberately including modified data in the training dataset on its ability to build a good, detective model.

 

And the results of the research are promising. The researchers have begun with trying to generate a model that classifies image data. In the training dataset, images are included that are designed to mimic worst-case attack scenarios. Despite these anomalous data points, the machine learning algorithms still manage to generate a model, and the result is much more robust to attack than one built based off of purely “clean” data.

 

Securing Web Applications Using ML

 

The new methods of improving ML algorithms using “vaccination” is an important first step in developing cyber defense solutions capable of detecting attacks against web applications. However, they’re not a perfect solution. If an attacker can corrupt the training data or learn the features that an ML-based algorithm uses to make detections, they can design malware specifically to avoid detection. A major concern for cyber defenders is that the future of cybersecurity will include AI vs. AI battles where attacking and defending AI try to understand and overcome each other’s algorithms.

 

The potential for ML-based solutions to be defeated is why defense in depth is so important for defending crucial systems like web applications. Using machine learning to protect the network is a great idea (and is incorporated into modern cyber defense products), but they are best as a supplement to existing solutions. A modern web application firewall (WAF) should incorporate both machine learning and signature-based detection mechanisms to provide comprehensive protection to an organization’s web applications.

 

Leave a Reply