Blog

Cybersecurity in Cloud HPC: How to Prepare for the “When” and Not the “If”

Post by
Ernest de Leon

A lot has been made of the SolarWinds hack in the last several months. While we don’t yet know the entire depth and scope of the attack, we can look at what we know so far and start looking at ways to beef up our cloud high performance computing security. 


As mentioned in one of our most recent episodes of the Big Compute podcast, the majority of the areas we look at for traditional on-premises security also apply to the cloud. Let’s look at some of the key areas to focus your security attention to, given the recent increase in cyber attacks against nations and corporations.


Physical Security


I believe securing your physical systems is the most obvious, yet is often overlooked with respect to cybersecurity. A lot of assumptions are made in this area, and we should always question assumptions. 


With respect to cloud HPC, you should make sure that the underlying cloud service provider (CSP) has proper controls in place to manage physical access to data centers. Only employees should be allowed access, and there should be a valid business justification for that access. Furthermore, steps should be taken to ensure that these CSP employees are only able to access the specific areas within the data center they need to access, and not have access to the entire data center. 


There should also be controls in place to handle 3rd party vendors (think internet service providers, HVAC providers, etc.), and those should mandate that an employee chaperone at all times. If you have additional concerns, such as working with ITAR data, you should ensure that access to those specific data centers (often called GovCloud) is only allowed for US persons as defined by ITAR. 


Network Security


In the cloud, the majority of network security revolves around virtual private cloud(s) (VPCs) and the security rules we put in place around those. In AWS, these are called security groups, and they are essentially firewalls and network rules that dictate what traffic is allowed to flow across what networks - defined by ports and protocols. The key is  to ensure that you have properly segmented your application stack by VPC and/or security group, and that your perimeter firewall(s) (security groups) are configured properly. Lastly, you should use services like CloudWatch and Drift Detection to ensure that your security group rules are not modified without your knowledge. 


OS Security


In the cloud, OS security  takes on some additional considerations. Much like on-premise systems that are typically deployed from an automation system that uses an operating system “golden image,” cloud instances can be deployed from a wide variety of base images provided by your CSP and others. 


The “and others” part of that last statement is of particular concern. You should always validate any base image that you use to spin-up a cloud instance, ensuring that you trust the source of that image, but also ensuring that your security team has scanned the image for known vulnerabilities. 


While it may not be practical to create your own base images for everything, that is the only way to ensure the base images have not been manipulated maliciously before you spin them up. The CSP-provided base images are generally safe, but always verify that with your security team. 


Application Security


Application security can be a bit confusing as the label is quite broad. 


From the perspective of the typical consumer of cloud HPC resources, this means that your cloud HPC platform provider is maintaining the images of the applications you are running your workloads with properly - like ensuring that the most recent versions of the applications are available and ensuring that any current “in production” images are properly patched against security vulnerabilities - both the underlying OS as well as vendor-provided patches for the application software itself. 


The last thing to ensure is that the applications themselves are not allowed to communicate in ways that are not authorized. For example, an application should not be allowed to connect to random remote IP addresses. Only known IP addresses such as an NTP server or license server should be allowed 


User Security


In the cloud HPC space, all access is remote, so user security must be bulletproof. 


This is the first and last line of defense against malicious actors looking to commandeer control of an environment. There are two key areas to focus on here.


The first is that all user accounts are validated against an enterprise directory services provider (such as Microsoft Active Directory), or that you as the consumer of the resources are manually validating every user account in your workspace (not recommended). 


The second is that you have mandatory multi-factor authentication (MFA) enabled. Using hardware-based tokens is recommended, but software-based tokens will suffice. Better a software-based token than an SMS message or no MFA at all. 

"VOLKSWAGEN: QUEL MILIONE BUTTATO IN PASSWORD DIMENTICATE" by automobileitalia is licensed under CC BY 2.0


File Security and Integrity


For file security, there are two areas to pay attention to. 


One is file security on cloud instances. If you are properly securing your OS as we discussed above, you are 90% of the way there. The remainder is to make sure that any sensitive data that resides on an instance for any amount of time is always encrypted (in transit or at rest) and that only verified users have access to that data. 


The second area of file security has to do with object stores such as AWS S3 and Azure Blob Storage. With object stores, you need to make sure that your object containers or buckets are not publicly accessible (unless that is intended), and that access control lists are properly configured. You should clearly specify what users are allowed access to any container or bucket, as well as what permissions they have on the objects in that container. 


For file integrity, the concern is with whether or not the file is fully intact and has not been modified without knowledge. There are several ways to verify this, such as hash validation, but cloud systems have logging around object stores that can notify you of such a breach in real time.


Logging and Monitoring


Logging and monitoring begins to tie cloud HPC security together. If you take each of the above key areas we’ve discussed thus far in this article, you might ask yourself “how do I know if any of the steps I’ve taken to secure my cloud HPC environment have been violated?”


This is where logging and monitoring come in. 


Logging takes the many events that happen across a complex system and stores them in a central place for you to analyze. These events are of various types, often cataloging the successes and failures of various system functions. 


Logging alone is not a robust strategy. You need to do something with those logs.

You can configure alerts based on them, such as receiving an alert when an unauthorized user has attempted to access a file in an object store. This is the monitoring side of the equation. AppDynamics (a part of Cisco) has a great metaphor for monitoring and logging. Monitoring is like the security alarm that alerts you to a possible intrusion. Log files are like the security camera footage that provides the evidence as to what happened and how. Both are necessary for a robust strategy in this key area. 


Redundancy and Backups


We can take every precaution known to the human species to prevent and mitigate an attack, but as evidenced by the SolarWinds fiasco, sometimes that is not enough. 

"SolarWinds sign" by sfoskett is licensed under CC BY-NC-SA 2.0


So when the worst happens, what do you do? 


This is the point in my typical security conversations where I pivot to business continuity (BC). You can see the uncomfortable faces around me when I begin this part of the discussion because the premise here is that the worst has happened. 


In the cybersecurity world, we do not stick our heads in the sand and hope nothing bad happens. We know better. It is not a matter of if, but when you will be hacked.


Redundancy ensures that no single component failure takes your entire system down.


Understand that complex systems are built of many components, and there are often subcomponents of larger components. So something as small as a single compute instance can be a component, but something as huge as an entire CSP region can be a component of a larger global server load balanced (GSLB) system. It is vital that every component of your system has a redundant copy that can be put into service rapidly to prevent a degradation in service to your users. 


Backups are for when the worst of the worst happens. 


Consider all of the recent ransomware attacks you’ve read about happening to schools, hospitals and even large multinational corporations. These kinds of attacks render data, arguably the most important part of any complex system, unusable. A hacker could gain access to your system and proceed to delete all of your critical data for fun or profit. You cannot assume that your data will be safe if there is only a single copy. Make sure you have at least one copy of your critical data that you can recover from rapidly should your system be taken offline. 


You should also consider a second backup copy (3rd copy overall) that is stored in a different cloud provider in the event that the CSP is hacked themselves. Prices can start to get significant when you’re talking about two or three copies of a lot of data, so you need to do a risk assessment and cost analysis to determine what you are willing to do in the event the worst happens. 


Keep Your Business Going


All of these areas are vitally important. Some, like file integrity, redundancy and backups carry a slightly higher priority as they allow you to recover from a breach and maintain business continuity. 


One thing that the SolarWinds hack taught us is that no matter how many precautions you take from a cybersecurity perspective, breaches still can and will happen. The best thing you can do is be prepared by covering the bases we discussed in this article. At the very least, your business can remain operational, despite a breach. 

Ernest de Leon is a futurist and technologist who loves to be at the intersection of technology and the human condition. A long time cybersecurity leader, Ernest also has deep interests in artificial intelligence and theoretical physics. He spends his free time in remote places only accessible by a Jeep. He currently works as Director of Security and Compliance at Rescale, and is a host on the Big Compute Podcast.


* Header image attribution: "Symposium Cisco Ecole Polytechnique 9-10 April 2018 Artificial Intelligence & Cybersecurity" by Ecole polytechnique / Paris / France is licensed under CC BY-SA 2.0