Use Big Data Techniques to Track Threats to Your Corporate Data

Author: Dale Kim, Sr. Director of Industry Solutions, MapR Technologies

We know that protecting corporate data is not easy, but many organizations aren’t doing themselves any favors by being lax with cybersecurity. What’s worse is that they often don’t realize they are being lax. So it’s no wonder we continually hear about new breaches – recent reports pertain to exploits against Apache Hadoop, MongoDB, and Elasticsearch installations – and we will continue to hear about more if basic steps aren’t universally taken. In the aforementioned breaches, the victims did the online equivalent of leaving their front door open with nobody home. Simple protective measures could possibly have minimized damage from these attacks. Interestingly, in the case of the Hadoop attacks, the hackers were seemingly using a tough love approach to tell the world to protect its data.

The path to a more secure environment starts with awareness and a good amount of diligence. It is too easy to think that network breaches only target others or that the risk is so minimal that the security effort should be commensurate. The problem with that stance is that the downside can be so great that it more than justifies the diligence required to secure your environment properly.

So now that the recent news has gotten our attention (again), let’s remind ourselves to take basic steps like enabling security mechanisms in your software, changing default passwords, and protecting ports with your firewall that don’t need external internet access. These are significant steps forward, but there are many other security issues to consider. These basic steps protect you from a certain set of attacks, but many other attack patterns are emerging. For example, even when seemingly sufficient controls are in place, software vulnerabilities can leave you exposed. Modern cyber threats are constantly evolving to overcome the best practices that address the well-known attacks.

What else, then, should you do? Monitoring internal activities is important. While it isn’t an absolute preventive measure, seeing a potential attack in progress can help you to respond and defeat the attack. This surveillance is especially important in cases where legitimate access paths to data are compromised via social engineering or by unwitting employees. So if an attacker obtains legitimate credentials to access your data, you need analytics to discover that breach.

There are known behaviors that should raise red flags, and many systems today are ideal for alerting the security team to such activities. For example, an extraordinarily large set of data accesses in a short period of time by a given user might represent data theft in progress. Unfortunately, sophisticated attackers often know what security professionals are looking for, so they change their methods to obscure their attack patterns. More advanced analytical capabilities can help in this situation. Collecting and analyzing behavior data can help to identify the non-obvious and previously unknown cyber threats.

One of the challenges of analyzing data access patterns within your firewall is the amount of auditing, server logging, networking, and clickstream data that gets generated. Typically, you want to track all activities in your data platform, including administrative activities. This tracking leads to a large volume of data that continues to grow. Another challenge is how to filter data so that the suspicious behavior can be separated from normal behavior. You don’t want to chase after every mildly suspicious activity that represents false positives, or else you will waste a lot of effort for no gain. On the other hand, you don’t want to miss with false negatives, or you might face a cyber threat disaster. Ongoing analysis will help to identify a confidence level of what is very likely a threat and what is not.

Machine learning technologies are growing in popularity to help identify anomalous behavior that requires further investigation. Open source tools, as part of Apache Hadoop, Apache Spark, or other commercial packages, are helpful for deploying machine learning environments. Other interesting trends in big data can help facilitate the architecture of large-scale, cybersecurity systems. For example, the use of microservices complement machine learning environments very well. Modular and componentized data pipelines built on microservices can allow numerous machine learning models to be run in parallel to identify the most effective results. Microservices also enable agility and easier scaling to handle the challenges of analyzing a rapidly growing big data deployment.

Cybersecurity requires a comprehensive approach in which the easy steps must first be taken, and then a larger-scale approach to analyzing data access patterns will prevent other types of attacks. It’s often too easy to dismiss security practices, because it appears the risk of cyber intrusions is low, but as long as we continue to hear about new attacks, we all should think about cybersecurity as a priority for our corporate data.

Dale Kim is the Sr. Director of Industry Solutions at MapR. His background includes a variety of technical and management roles at information technology companies. While his experience includes work with relational databases, much of his career pertains to non-relational data in the areas of search, content management, and NoSQL, and includes senior roles in technical marketing, sales engineering, and support engineering. Dale holds an MBA from Santa Clara University, and a BA in Computer Science from the University of California, Berkeley.