Manufacturing and critical infrastructure entities are being targeted by cyber attackers more heavily with manufacturing now being the most targeted industry. Vulnerabilities in OT systems continue to expand as we saw another 1200 CVE’s from CISA’s ICS Advisories in 2022 alone. Insurance companies are requiring separate OT cyber applications – these require active OT systems management such as vulnerability and patch management, backups, etc. And as industrial organizations increase their investments in cyber security, the available resources become more expensive and less available.
More and more organizations are looking to external resources to support their OT cyber security journey.
In 2023, Verve celebrates its 30th anniversary. Founded as an automation services company, providing vendor-agnostic solutions and services to companies across various industrial sectors, Verve continues its heritage today supporting industrial organizations in the security and reliability of their OT systems.
What is OT Managed Security Services?
Managed Security Services Providers (MSSP) is a term initially developed for IT security. Gartner defines MSSP as an organization that “provides outsourced monitoring and management of security devices and systems. Common services include managed firewall, intrusion detection, virtual private network, vulnerability scanning and anti-viral services…. (their goal is to) reduce the number of operational security personnel an enterprise needs to hire, train and retain to maintain an acceptable security posture.”
MSSP evolved from the concept of an IT “MSP” or Managed Service Provider that focuses on broader outsourced IT Systems Management. These firms grew over the past 20 years to scale IT resources for device management: configurations, applications, patching, backup & restore, etc. There was a natural extension of this outsourcing to the heavy burden of security. So in IT, the term MSSP has often been reduced to real-time Security Operations Center (SOC) services monitoring Endpoint Detection & Response (EDR) and other threat detection solutions because the rest of the IT Systems Management functions have been absorbed within the traditional “MSP” outsourced agreements.
OT, however, is a different world. OT systems have not had a thirty-year history of active “systems management”. We define systems management as the tasks of regularly maintenance of computer technology systems. In OT, these devices certainly include the more “IT-like” systems such as servers, workstations, HMIs, switches, routers, etc. But OT also includes the huge range of OT-specific devices such as PLCs, controllers, relays, sensors, drives, robots, CNC machines, printers, labelers, building controls, etc. Other than perhaps OEM/contractor support for basic reliability maintenance of these assets, the historical mindset in operations is “if it ain’t broke, don’t fix it”. As a result, these legacy devices may remain unmanaged for years. This is not only true for the specific OT devices, but in most cases this perspective extends to the “IT-like” equipment as well because they are core to the operations of that process.
Effective cyber security builds on these foundations of systems management. Monitoring for threats in an unmanaged environment is a fool’s errand. Insecure user accounts and configurations, unpatched systems, misconfigured firewalls, etc. all can make monitoring ineffective at best or misleading at worst creating thousands of false alerts and noise.
Therefore, OT Managed Security Services needs to encompass not only the SOC threat detection services, but also all of those traditional IT Systems Management functions that have been in place in IT for the past thirty years. This includes activities such as:
- Maintenance of accurate asset inventory
- Vulnerability and patch management
- Maintenance of EOL devices
- Network device uptime and configuration maintenance
- Backup and restoration management
- Application whitelisting maintenance
- Deployment of new devices with appropriate configuration
- Management of appropriate user and account access control
This is in addition to the SOC services around monitoring for alerts and responding.
As mentioned, Verve has provided these types of services for 30 years. To determine how to approach the OT Managed Security Services strategy – including whether outsourcing all or parts of this function is appropriate – an organization should consider 3 key questions based on this experience.
- What are the security risks in the OT environment?
- Therefore, what are the key priorities for security remediation and maintenance?
- What is the best mix of internal and external resources to efficiently address these risks?
Security risks in OT
Over the past 15 years, Verve conducted hundreds of technology-enabled vulnerability assessments using the Verve Security Center to gather accurate and detailed risk perspectives from OT systems. The short summary of these assessments is that these environments are not managed for security. Perhaps this should not be surprising given the historical focus on operational reliability, uptime, legacy systems running until they stop, etc. Further, I&C technicians and others involved in plant operations are not normally trained in IT security.
Finally, in almost all cases, these devices are not managed by IT systems management functions due to their different operational nature. As a result, the current state of these systems is that they contain risks at multiple levels: critical vulnerabilities and unpatched systems, EOL devices, insecure configurations, lack of updated anti-malware, lack of effective network segmentation, unmanaged remote access connections, etc.
Obviously, the above is not specific to any individual organization’s environment. But it is a very representative picture based on decades conducting these assessments.
Most organizations have multiple plants, sites, mills, etc. and each of those will have slightly different variations of the above risks. Perhaps one site has a controls technician that is more focused on secure environments and maintains the AV signatures, others may have contracts with OEMs that actively patch the OEM software on their HMIs and servers. Still others may have regulatory requirements that encourage more – or less- active systems management.
The key is that the organization understands the risks both at a site-by-site level as well as developing an overall enterprise picture of the risks. These risks form the basis for the roadmap of remediation requirements.
The next critical step is to prioritize these risks. When a plant has thousands of critical vulnerabilities, no backups, a weak segmentation, etc. determining the priorities can be challenging. Verve works with clients to create a risk score to narrow down the huge number of risks to those with the greatest threat to the process. This includes understanding the specific device criticality and blending that with the risk scores of the various vulnerabilities and risks on the asset/network itself.
The below diagram is a simplified graphic of how the Verve Security Center platform aggregates data across different risk categories and asset criticality metrics to define the greatest risks to the environment. Regardless of whether an organization uses Verve, it needs some manner to consolidate data to be able to prioritize risks effectively to begin remediation.
Developing the remediation and maintenance roadmap
Once the organization defines the risks and priorities, it should develop a roadmap of remediation maintenance activities. The good news is that there are a range of resources an organization can use to build that roadmap.
The United States Critical Infrastructure and Cyber Security Agency (CISA) provides a recommended set of security measures for critical infrastructure. The below graphic is a summary of those recommendations.
As one example, CISA’s specific recommendations to address the Industroyer2 malware found in Ukraine in 2022 focused on just such foundational elements:
“DOE, CISA, NSA, and the FBI recommend all organizations with ICS/SCADA devices implement the following proactive mitigations:
- Isolate ICS/SCADA systems and networks from corporate and internet networks using strong perimeter controls, and limit any communications entering or leaving ICS/SCADA perimeters.
- Enforce multifactor authentication for all remote access to ICS networks and devices whenever possible.
- Have a cyber incident response plan, and exercise it regularly with stakeholders in IT, cybersecurity, and operations.
- Change all passwords to ICS/SCADA devices and systems on a consistent schedule, especially all default passwords, to device-unique strong passwords to mitigate password brute force attacks and to give defender monitoring systems opportunities to detect common attacks.
- Ensure OPC UA security is correctly configured with application authentication enabled and explicit trust lists.
- Ensure the OPC UA certificate private keys and user passwords are stored securely.
- Maintain known-good offline backups for faster recovery upon a disruptive attack, and conduct hashing and integrity checks on firmware and controller configuration files to ensure validity of those backups.
- Limit ICS/SCADA systems’ network connections to only specifically allowed management and engineering workstations.
- Robustly protect management systems by configuring Device Guard, Credential Guard, and Hypervisor Code Integrity (HVCI). Install Endpoint Detection and Response (EDR) solutions on these subnets and ensure strong anti-virus file reputation settings are configured.
- Implement robust log collection and retention from ICS/SCADA systems and management subnets.
- Leverage a continuous OT monitoring solution to alert on malicious indicators and behaviors, watching internal systems and communications for known hostile actions and lateral movement. For enhanced network visibility to potentially identify abnormal traffic, consider using CISA’s open-source Industrial Control Systems Network Protocol Parsers (ICSNPP)
- Ensure all applications are only installed when necessary for operation.
- Enforce principle of least privilege. Only use admin accounts when required for tasks, such as installing software updates.
- Investigate symptoms of a denial of service or connection severing, which exhibit as delays in communications processing, loss of function requiring a reboot, and delayed actions to operator comments as signs of potential malicious activity.
- Monitor systems for loading of unusual drivers, especially for ASRock driver if no ASRock driver is normally used on the system.
As you review these recommendations and the risks Verve found in the assessments, the clear focus is on OT Systems Management (OTSM). OTSM is similar to what IT has been doing for years on office systems – patching, vulnerability management, configuration management, access control, network segmentation and device management, backup and restore, etc. It certainly also includes monitoring for potential active threats via logs and network monitoring. But as the CISO of a leading specialty chemicals provider said to us recently, “We have to focus on securing the fundamentals of vulnerability, patch, user account management, etc. and not just rely on monitoring for threats after they get through. Monitoring is too uncertain for us to rely on.”
Often we hear from analysts or other vendors that “OT security = network monitoring”. But the clear recommendations of CISA – and our IT colleagues – clearly call this into question. Instead there needs to be a focus on core identification, protection, response and recovery elements in OT. This focus on fundamental protections should not surprise us. The NIST CSF as the most followed OT security standard clearly calls out the 5 key elements of security.
When you consider the NIST CSF, most of the activities focus on these “systems management” functions: maintaining accurate inventory even in complex networks, gathering users and accounts and locking them down, updating patching and hardening configurations, ensuring anti-malware solutions are up to date, etc.
This strongly aligns with our assessment findings from the past fifteen years. If you look at the risk analysis chart shown above these are things some refer to as “cyber hygiene”. A roadmap could look like the below:
Managed Security Service Design
The roadmap establishes the types of remediation activities and the sequencing of those tasks. The next question is how to develop the right organization to manage these OT security tasks. Based on our experience, this comes down to 3 things.
- What types of skills/capabilities does the organization need to effect the remediation and management?
- What is the best organization structure to drive efficiency, but also to ensure safety and reliability of the industrial process?
- How to manage internal and potential external resources to deliver on the remediation and maintenance requirements?
Types of skills and capabilities
When one reviews CISA recommendations, NIST CSF requirements, or Verve’s assessment findings and the importance of “systems management” functions, it is not surprising that more than ¾ of all jobs currently posted for cyber security personnel are focused on these “systems management” tasks, rather than the “advanced analytics” that get most of the press. Cyber security is about fundamentals and operations.
The above data from the NICE Cyberseek database is mostly from IT security roles. So as we move into the OT environment where systems management is much further behind, the focus on OT systems management functions and skills is even more important.
As organizations consider the types of people needed for effective remediation, focusing on finding personnel with the above skill sets is key. The good news is that these skills often exist within IT teams. The bad news is that these teams are usually already overwhelmed with IT requirements, and these OT systems management requirements become additional work in no one’s budget or resource plans. Additionally, the management of OT systems requires a sensitivity to the processes this hardware and software controls. We have all heard of IT personnel approaching patching or vulnerability management as if these devices were an enterprise laptop and tripping plants or worse. Therefore, the organization needs to balance IT-type skills with OT knowledge.
Right organizational model
This balance of IT & OT knowledge leads to an organization model that integrates these capabilities to deliver the efficiency required with the sensitivity to OT necessary. Over the past fifteen years, Verve has developed a concept we call “Think Global:Act Local”. This organization model allows the organization to scale those planning, prioritization, change management and other tasks into a central team that can scale across the whole enterprise while ensuring that the actions that can impact OT operations are distributed to those most able to judge the timing and approach they use to update, harden, and generally manage these systems.
The ”Think Global” nature of this is similar to the concept of Managed Security Services in a SOC (Security Operations Center). But in the case of OT, the tasks conducted by this central team are broader than monitoring for real-time threats and alerts in a SIEM or through EDR. Because of the unmanaged state of the systems, this central team also prioritizes the remediation and maintenance of the “cyber hygiene” actions – from patching to configuration hardening to software management. In most industrial environments the site-level personnel do not have the skills – or more importantly the time – to analyze all of the potential risks on an ongoing basis. They haven’t the experience to prioritize which of these risks is most critical to remediate first, second or third. Further, even if they did, relying on local resources would likely result in highly varied approaches and maturity.
A central platform that aggregates all the site-level risk data from distributed operations allows a small central team to be trained on how to prioritize risks within your enterprise. The local knowledge is certainly important, and a platform can aggregate information from each site such as the asset criticality, process impacted, etc. But then the “Think Global” team can analyze these and create “playbooks” for remediation. This could include a broad network segmentation effort all the way down to applying specific patches on specific systems. This drives efficiency in scale and effectiveness in skills and consistent prioritization. It also allows for monitoring that the remediation actions actually occur.
However, the execution of those actions in this model, rely on the “local” expertise of the process controls knowledge. “Local” does not necessarily mean “site-level”. It really refers to the “local knowledge” of the process. This could be managed by a regional SME as shown above or through a control center. But the key is that when actions to manage devices occurs, there is a person that understands the potential impact of those actions on the process involved prior to action.
This “Act Local” ensures that patches are tested and deployed only at appropriate times, incident response actions are not taken that could disrupt operations unless the incident is absolutely critical. It ensures that when users and accounts are managed, critical service accounts that only run every six months or when a redundant controller has to come online. It brings IT & OT together into a partnership to ensure scale and safety/reliability.
Right balance of internal and external resources
The last question is how to address the required resources to deliver on the remediation and maintenance of an organization’s OT security. The NICE Cyberseek database referenced above highlights the huge gap in available cyber security resources. As of the start of 2023, there were 750,000+ open cyber security positions in the United States alone vs. the current employment estimates of approximately 1.1 million. This is a huge gap, essentially stating that the nation is missing 40% of the required employees needed to secure the environment based on actual job openings. And this is really focused on IT cybersecurity where the need to understand OT is not required.
In OT security, even with the Think Global:Act Local model identified above, the team needs to have an understanding of OT systems at both the global and local level. Yes, TG:AL allows scaling of those resources at an enterprise level, but OT skills are still required.
Knowledgeable resources are still the number one challenge according to OT security leaders. According to the 2022 KPMG-CSAI2 survey of OT security shown below as well as our own experience with clients, resourcing is the biggest challenge.
So, how does an organization think about internal and external resources in OT cyber security?
Based on our experience, the structure of internal and external resources for OT security come down to a set of questions that each organization needs to address.
- Ability to recruit, train and retain
- Mix of “one time” and ongoing maintenance
- The availability of OT (both site and central) that could be trained in security or IT (May drive need to bridge security and operations systems management support) security/systems management resources that could be trained in OT
- The degree of distributed operational environments
- The necessity of coordination with OEM vendors
- Technology strategy to scale and integrate across internal and external teams
Recruit, train and retain
One of the biggest challenges in OT cyber is the ability to continually recruit, train and retain team members. In many cases, as an organization tries to scale its OT security team the pace of turnover in its ranks create significant challenges to maintain results. Given the above data about the resource challenges, recruiting new personnel is difficult often requiring months to find new skilled talent. We have seen case after case of an organization hiring or assigning a group of OT cyber staff, then beginning to train them on their OT systems and remediation strategies, and within a couple of years half of the team has taken new roles inside the organization, been recruited away, not met the performance requirements of the role, etc. So within 24 months 20-30% of the team is performing well and still there. The reality is that the organization is continually in a recruiting, training and retaining process…and in many cases losing the battle.
One-time vs. maintenance
In almost every OT cybersecurity journey, there is a significant one-time remediation “catch up” or hardening that must occur. This can include network segmentation, patching catch up, application software clean up, hardening configurations, etc. Although these tasks and systems will need to be maintained over time, there is a significant spike in required resources to achieve these things. In many cases, it does not make sense to scale up resources to do all of these tasks only to reduce over time. Further, in most cases it is almost impossible to scale up quickly enough to address the risks to satisfy boards and management teams once they see the risk profile. Therefore, we see most organizations leverage third party resources to conduct this “catch up” or “hardening” activities even if they plan to manage the maintenance internally.
Availability of current resources & need to bridge security and operations support
In some organizations, there are personnel that have the knowledge to step out of their OT operational functions or IT security functions to support the OT systems managed required. Although this is more the exception than the rule given headcount challenges, where feasible, this is a great way to get started quickly.
What Verve typically finds is that in most organizations, the OT staff does not have systems management knowledge required to upgrade and monitor & maintain security over time. For instance, in many cases network segmentation is necessary to protect the environment. This usually means putting in new firewalls, switches and in many cases new fiber infrastructure to connect systems to these new process controls networks. New software is installed to manage application whitelisting, backups, etc. Once these systems are hardened, questions emerge almost immediately: operations personnel want to add applications to the whitelist, operational issues arise and the site personnel can’t determine whether its due to the new network architecture and systems or something related to the OT hardware, etc. Or as new alerts emerge from security monitoring systems, questions emerge whether this is truly a security risk or an operational anomaly. As a result, the systems security management tasks merge with operational questions.
In our experience, it is key that the people responding to the security topics have deep knowledge of the OT environment in which these controls are deployed. Essentially, the cyber security resources end-up acting as operational system support resources because to define whether it is a security issue – or an issue with security software or hardware – requires operational trouble-shooting.
Degree of geographic or process distribution
Most industrial environments include geographically distributed environments whether they be pipelines, multiple manufacturing locations, electrical distribution, etc. This distributed nature increases the resource challenges relative to IT which are often managed centrally. In OT, due to the process implications of any security change, the systems management must have a “local” component to it, whether that be regarding timing of change, testing of updates, or local process troubleshooting.
We have found that two elements are key to success in these environments: a) local knowledge of the systems deployed in the triaging of risk and threats, and b) the ability to locally deploy to solve an issue. Earlier in this paper, we highlighted the idea of a regional SME role working across sites. This type of role is often best served through external resources who can remotely deploy quickly as necessary to address security or systems management issues not resolvable remotely.
Coordination with OEM vendors
In many cases, software or firmware updates need to be coordinated directly with OEM vendors. This is most prevalent in Distributed Control Systems where each component needs to be sequenced and coordinated with the others. Software application patches may require consistent updates to controller software, etc. In many cases, these are specific to the process at that individual site.
Some organizations heavily rely on these vendors, but the challenge becomes how to both coordinate across sites and OEMs as well as how to manage those devices that are not part of the broader DCS – e.g., the environmental control system or the water treatment system. We have found success by establishing a coordinating external party to manage across these OEMs as well as the “balance of plant” systems to create consistency across plants and OEM systems. This “master contractor” ensures that there is consistency applied across system.
This could certainly be a task performed by internal personnel as well, so long as that team has the requisite understanding of the different control systems
Use of consolidated enterprise technology platforms
The coordination highlighted in #5 above implies the need for centralized reporting, analysis, and monitoring platforms. We see so many organizations struggling with various OEM security solutions and OEM support contracts across sites. They find it impossible to manage and track consistent applications of systems management. Others who have deferred security to plant leadership – a great way to ensure accountability – struggle with tracking that accountability and adherence to the controls intended. A centralized platform allows to track and report on progress and consistency.
This centralized platform should extend to tracking the status of where security management actions. In most cases, we find leveraging the organization’s IT systems management toolkits is the best way to accomplish this. For this reason, Verve formed a deep partnership with ServiceNow who is the leading provider of these types of tools. This allows us to provide a much more integrated ticket management capability than separating OT systems management tasks into their own workflow tool.
A centralized platform is absolutely critical to efficiently managing the security as well as operational reliability of the OT systems. By aggregating data from all security controls across all sites with key reliability data such as device uptime, performance, process alarms, log data, etc. enables the “Think Global:Act Local” approach to deliver the efficiency necessary.
Our experience is that the most effective approach is to balance internal and external resources. It is key that each organization maintain the OT security leadership and overall management and responsibility. This oversight and coordination function is key not only to ensure delivery of expectations, but also to act as the integration point between the external party and key internal stakeholders. For instance, in many cases external resources will need access to certain IT functions or personnel. They may require coordination with operations leadership. They may need to align policies between IT and OT. The internal OT security leadership can act as that bridge.
For many organizations it may make sense to build all of these functions internally if they have the scale and commitment to recruiting training and retention. For many others a blend is likely more relevant.
Verve OT Systems Security Managed Services
As mentioned above, Verve brings 30 years of experience to OT Systems Management. Founded as an automation engineering firm, Verve retains its heritage of deep vendor-agnostic knowledge of these control systems. This began with reliability support in areas such as alarm management. It extended to reliability of data historians through monitoring and response to issues to ensure 98%+ uptime SLAs. Over the past fifteen years, Verve has provided security support services in the areas of ongoing vulnerability assessment, patch discovery and management, backup and restore maintenance, network segmentation and maintenance, configuration hardening, and application whitelisting and AV support.
We have found the following elements as those most valued by our clients:
- Commitment to customer satisfaction and over-delivery on expectations. As one of our clients has said “Your team continually exceeds our expectations. Their knowledge and commitment to our success is truly appreciated.”
- Depth of OT knowledge. Our clients tend to refer to our team as the “OT Whisperers”. This presents itself when our team shows up with the IT security teams to provide comfort to the OT teams that the personnel touching their systems understand them as much as they do. It presents when a potential security issue arises and our team links its security and OT process knowledge to understand the true issue. It presents when remediating a risk in the comprehensive steps taken to ensure no disruption to operations. It is very difficult to replicate 30 years’ of OT experience.
- Understanding of the 360-degree risk management necessary in OT. OT security cannot be done in silos. IT security, unfortunately, has grown-up in silos with patch management separate from configuration hardening separate from user and account management, network management, etc. In OT, however, resources have to understand the complete picture of the OT system and its security. In many cases, a critical patch may not be feasible due to uptime requirements or a central AD may not be available to manage users, etc. The external resources need to understand network segmentation and configurations, how application whitelisting can reduce risks, what patches can and should be deployed, whether a configuration could offset mitigate the vulnerability, etc.
- Verve’s resources understand the breadth of security elements to provide a single interface to the client organization to both drive efficiency, but most importantly ensure that the most effective security control is deployed.
- Efficiently managing tasks through centralized technology. The Verve team has deep knowledge of the Verve Security Center as well as a range of other security elements such as AV, whitelisting, network monitoring, etc. Key to efficient OT Security Systems Management is the ability to effectively manage centralized technology. The Verve team is steeped in these technologies providing rapid reporting and analysis as well as new analytics based on integrating the various security tools implemented.
Our client experience demonstrates significant efficiencies in partnering with Verve. Our clients find that they can reduce the cost and time of assessments by 50% and the ongoing remediation of risks by up to 70% due to the combination of Verve’s OT security knowledge and our Verve Security Center platform.
We look forward to an opportunity to share our experiences.
OT Systems Security Managed Services are a different breed from IT Managed Security Services. They require an understanding of the unique risks of OT environments as well as the lack of a tradition of systems management that is normally found in IT. We have found that the most effective way of staffing these efforts is a mix of internal and external resources to create the ongoing scale and consistency of OT support necessary. Verve’s 30 year heritage of providing distinctive client service can help any industrial organization accelerate and maintain its OT cyber posture. Please reach out with questions.