Predicting Malicious Network Connections Using Splunk and AI

One of the most fundamental responsibilities that blue teams have is being able to detect threats within the environment. It’s the reason why tools like antivirus, EDR and IPS/IDS exist in the first place. Without these tools, being able to do the bread-and-butter job of defending environments becomes a lot more difficult. Blue teams and red teams are in a constant game of cat and mouse; red teams develop new and creative approaches to attack environments, while blue teams come up with ways to detect (and ideally prevent) the attacks. As someone who’s played on both red and blue teams, I understand the difficult task of trying to detect all the new kinds of attacks. Unfortunately, traditional methods such as signature detection (e.g. antivirus) no longer work, as attackers can encrypt payloads or perform completely file-less attacks. Even EDR, the best current endpoint defense solution, can be bypassed with the correct means of attack. 

However, there is one constant thing that is hard for attackers to mask in any kind of attack: the network connection itself. When it comes to establishing command and control of an environment or executing an attack on an environment from an external source, there has to be some kind of connection to the environment from the external source. The external source can be a cloud IP, masked by VPN, or gone through some kind of other obfuscation, but the connection from the attacker has to exist. Therefore, one has to wonder whether it’s possible to identify malicious connections solely based on properties of the network connection itself.

My previous posts have focused on anomaly detection within logs using machine learning. This is a great approach for defending an environment, but network connections work a little differently. Sure, you can develop an AI to detect anomalous network connections (which may prove useful under the right circumstances), but doing so is going to result in a lot of connections that may in fact be benign and just so happen to occur infrequently in the environment. Instead of anomaly detection, is there a way to detect malicious connections solely through the properties of the connection itself?

Machine learning is great at anomaly detection, but it also excels at something known as prediction. Interestingly enough, you can train an AI to make predictions based on how you train it. This is similar to anomaly detection, but anomaly detection is considered unsupervised learning while prediction is considered supervised learning. In unsupervised learning, an AI makes its own decisions on what it learns given some set of data. When I created the AI to detect malicious domain controller logs, the AI learned the domain controller logs on which it was trained and was able to determine when a domain controller log deviated from what it learned. With supervised learning, the creator of the AI tells it what to learn based off certain features of the data.

Training an AI to predict malicious network connections will require us to give it the properties of known malicious network data. Once we tell it what malicious network data looks like, it can then use what it learned to determine whether new network connections are malicious with some degree of statistical certainty. We’re going to see an example of this within Splunk.

Splunk has an app called the Machine Learning Toolkit (MLTK) that enables machine learning capabilities on a Splunk instance. In fact, the Data Science and Deep Learning (DSDL) app depends on the MLTK app to work properly. The MLTK app comes with basic machine learning functionality and data sets on which you can experiment. One of the data sets is called firewall_traffic.csv:

One of the most fundamental responsibilities that blue teams have is being able to detect threats within the environment. It’s the reason why tools like antivirus, EDR and IPS/IDS exist in the first place. Without these tools, being able to do the bread-and-butter job of defending environments becomes a lot more difficult. Blue teams and red teams are in a constant game of cat and mouse; red teams develop new and creative approaches to attack environments, while blue teams come up with ways to detect (and ideally prevent) the attacks. As someone who’s played on both red and blue teams, I understand the difficult task of trying to detect all the new kinds of attacks. Unfortunately, traditional methods such as signature detection (e.g. antivirus) no longer work, as attackers can encrypt payloads or perform completely file-less attacks. Even EDR, the best current endpoint defense solution, can be bypassed with the correct means of attack. 

However, there is one constant thing that is hard for attackers to mask in any kind of attack: the network connection itself. When it comes to establishing command and control of an environment or executing an attack on an environment from an external source, there has to be some kind of connection to the environment from the external source. The external source can be a cloud IP, masked by VPN, or gone through some kind of other obfuscation, but the connection from the attacker has to exist. Therefore, one has to wonder whether it’s possible to identify malicious connections solely based on properties of the network connection itself.

My previous posts have focused on anomaly detection within logs using machine learning. This is a great approach for defending an environment, but network connections work a little differently. Sure, you can develop an AI to detect anomalous network connections (which may prove useful under the right circumstances), but doing so is going to result in a lot of connections that may in fact be benign and just so happen to occur infrequently in the environment. Instead of anomaly detection, is there a way to detect malicious connections solely through the properties of the connection itself?

Machine learning is great at anomaly detection, but it also excels at something known as prediction. Interestingly enough, you can train an AI to make predictions based on how you train it. This is similar to anomaly detection, but anomaly detection is considered unsupervised learning while prediction is considered supervised learning. In unsupervised learning, an AI makes its own decisions on what it learns given some set of data. When I created the AI to detect malicious domain controller logs, the AI learned the domain controller logs on which it was trained and was able to determine when a domain controller log deviated from what it learned. With supervised learning, the creator of the AI tells it what to learn based off certain features of the data.

Training an AI to predict malicious network connections will require us to give it the properties of known malicious network data. Once we tell it what malicious network data looks like, it can then use what it learned to determine whether new network connections are malicious with some degree of statistical certainty. We’re going to see an example of this within Splunk.

Splunk has an app called the Machine Learning Toolkit (MLTK) that enables machine learning capabilities on a Splunk instance. In fact, the Data Science and Deep Learning (DSDL) app depends on the MLTK app to work properly. The MLTK app comes with basic machine learning functionality and data sets on which you can experiment. One of the data sets is called firewall_traffic.csv:

You’ll notice a field within the firewall_traffic.csv data named “used_by_malware”, which assigns a “yes” or “no” value to whether the network connection is considered malicious or not.  You’ll also notice other typical network connection fields within the data, such as bytes received, destination port, destination IP, etc. What we want to do is to create an AI that takes in certain properties of this firewall data and learn those properties to make a decision whether new network connections are “used by malware” or not. I realize that these are ideal firewall traffic logs that are neatly labeled with fields that you don’t see in real world firewall traffic (“used_by_malware”, “has_known_vulnerability, etc.), but it’s great for testing the supervised learning concept involving predictions. 

First, we’ll split the firewall_traffic.csv into two sets of data: training data and testing data. The training data is what we’ll use to train the AI on whether a connection is malicious or not. The testing data is what we’ll use to test what the AI has learned. The training data will contain the majority of the original firewall_traffic.csv data, but will leave out the last 1000 entries. The testing data set will contain the last 1000 entries of the original firewall_traffic.csv dataset. Of course, the AI will not be trained on the testing data containing the 1000 entries.

We will train the AI based on the following six fields:

bytes_received – if a known malicious connection involves a known number of bytes on the receiving end of the connection (either the source or destination IP), we want to know about it

bytes_sent – if a known malicious connection involves a known number of bytes on the sending end of the connection (either the source or destination IP), we want to know about it

dest_port – if a malicious connection is known to hit certain ports of a destination system, we want to know about it

has_known_vulnerability – if the IP address in question has a known vulnerability (e.g. a buffer overflow vulnerability that can be targeted externally), we want to know about it

packets_received – if a known malicious connection involves a known number of packets on the receiving end of the connection (either the source or destination IP), we want to know about it

packets_sent – if a known malicious connection involves a known number of packets on the sending end of the connection (either the source or destination IP), we want to know about it

We are excluding a number of fields to learn, including the source and destination IPs, source ports, session ID, timestamps, etc. because these values are known to change and we don’t want them to influence how the AI learns what constitutes malicious connections.

Now that we have the fields on which we want to train the AI, we can create the AI to learn the properties of these fields. Fortunately, this AI will be trained on completely numerical values and will train fairly quickly. I will be using the same approach that I used in my previous post for creating an AI, only this time I’m using a simpler dense neural network for learning. After training the AI offline, implementing the apply() function within DSDL, and importing the AI into DSDL, the AI is ready to be tested.

Testing the AI will also be straightforward: we will run each entry in the firewall_traffic_testing_data.csv dataset through the AI and have it determine whether the entry is used by malware or not. The AI will assign each entry a result ranging from 0.0 to 1.0. If the result is over 0.50, we will say that the AI determined the entry to be malicious. If it’s at 0.50 or lower, we will say the AI determined the entry to not be malicious. Because the AI is determining whether an entry is or isn’t something, this is known as a binary regression.

This is excellent, as it’s able to determine malicious connections solely off the six fields we told it to learn.

I want to make it known that this AI can be improved. To start, I treated the dest_port field as a continuous value instead of a discrete value. This is incorrect, as you can’t have a decimal value for a computer port like 443.7 or 80.4. Also, I can experiment with including some of the fields I excluded earlier from the training set to see how much of a difference it makes in terms of training the AI. Keep in mind that this is ideal firewall data, and the majority of real world firewall data is not as neatly labeled. Regardless, I hope this demonstrates the usefulness of supervised learning and predictions.

If you want to implement AI or machine learning within your Splunk environment, contact SmithSec to set up a consultation!

Previous
Previous

The Glaring Problems of SIEMs and Current UEBA Solutions

Next
Next

Splunk and AI, Part 2 – Threat Hunting on Domain Controllers Using Deep Learning