Public Safety Canada (PSC) and several industry partners/agencies recently put forth open guidance on developing an OT and IT Cyber Incident Response Plan (CIRP), and it is a great step forward as a foundation for future work.
Generally, Canadian federal agencies have a “laissez-faire” approach to managing industry, particularly for cyber security, but this in some ways breaks that mindset and continues it because of the way this release is written; its friendly guidance and full of Canadian voice.
For some, that is a good thing as it provides flexibility and speed, and for others, it is not prescriptive enough. So let’s dive in!
A cyber incident response plan is one of the capabilities you need to have, but it isn’t the first
First – the document provides clarity about what an organization might need before embarking on a mission to create a CIRP:
While those assumptions are accurate, they also imply the organization has a reasonable level of maturity and understanding of their environment on the IT side. From experience, many organizations might have those two assumptions covered, but they are often up to their ears in contradictions and putting out fires, especially in the ongoing vulnerability management and asset management categories. They might even struggle to enforce policies consistently across endpoints, laptops, or servers via software, off the land via Microsoft Group Objects (GPO), and recovering consistently from large scale ransomware…
Obviously, there are some learnings that all parties can benefit from so – I’d argue that the underlying assumptions about overall DETAILED asset management and how to recover are missing; Any response without those two things will suffer, and this is not clear in the document, but often knowing your IT assets (which are frequently used in OT by the way) can support the CIRP process.
Detailed asset management is a pre-requisite to an OT cyber security response plan
Next, the document proceeds to suggest to the reader you should understand what assets you have, and defining what OT is within your organization. It has a typical listing of OT assets which is good for those who need to define an organizational OT nomenclature to describe the assets within their inventory, but it is missing some categories of devices that fall through the cracks.
As you may notice, Public Safety Canada tried to skirt a few naming challenges. Supporting Computer-based Technology Infrastructure is some very creative wordsmithing to avoid the semantic debates (again, very Canadian), but I would like to augment that with a few more categories because this guidance is tainted by energy language:
- Sensors and relays (e.g., while often under protection systems, they can be standalone for things such as conveyor belts of managing load)
- Meters and valves (e.g., flow correction, automated shutoffs)
- Spectrum analyzers (e.g., devices used to determine if water or food are safe)
- Time synchronization (e.g., NTP servers, GPS, atomic clocks)
- Battery/backup systems (e.g., UPS or generators)
- Environmental & Human Safety monitoring (e.g., chlorine or H2S)
- Protocol gateways (e.g., serial to TCP gateways)
- Robotics (e.g., palletizers, welders, pill counters)
- Boiler/burner management systems (e.g., BMS)
Some of those will fit under the PSC’s listings, but it is worthy to discuss those elements. They are often forgotten and deserve respect because they are among the older components that get missed in acquisitions and divestment, or also provide functionality that OT depends on (for whatever reason).
Having the right team for a cyber security incident response plan is acknowledged but understated
Assuming we have a working idea of what an organization is working with, then PSC and friends jump to understanding organizational structure for a Cyber Security Incident Response Team (CSIRT). It defines the members, and the approaches to handling a response, particularly with respect to the organization’s composition. However, this approach has a fatal flaw: OT/ICS personnel are critical to ensure safety during an event and lead to a safe and timely recovery.
This section’s guidance is all well and good, but without addressing that fatal flaw on who may lead or play a very active piece in the CSIRP – I think this needs to be examined a bit more:
- The suggested C-suite managers often know very little to nothing about how their operations are technically run; they rely on others. They understand risk, they understand several things I probably do not know, but they likely do not know nor originated from the process side of things. It’s not as if they know how a refinery recovers from a shutdown, but rather know the repercussions to the business. Incident Response in OT is more like recovering from a safety incident where someone lost a limb – it’s investigative, technical, and needs the right people involved to get to the bottom of it.
- Responders are not operators, and they are also usually not the “recoverees”. It’s great to identify and neutralize malware, but if a plant has ongoing operations (even if isolated and limited by an impact), operations and site owners need control, especially when there are voluminous amounts of toxic chemicals or potentially disastrous consequences around the corner if things are not managed in a certain way. For example, if HMIs were infected at a nuclear power plant, you would do everything in your power to make sure cooling and monitoring of nuclear fuel would function. Recovery is very situationally and site-specific.
Hopefully, we managed to seat the right people at the table, and we have chosen a centralized or decentralized approach to manage the incident (no feedback here), there is a very large piece missing in how you approach an OT event. Generally, it is driven by process, and Health, Safety, & Environment (HSE) procedure. And in fact, there are Statements of Procedure (SOPs) for many activities taking place in these facilities & they should be abided by or acknowledged to prevent the situation from becoming worse.
Knowing the OT SOPs, supporting processes, and having the right individuals present to lead the event handling and recovery leads to a more successful, less impactful outcome. Recognizing this viewpoint is critical to the business’ chances of a best-case recovery and improve communication whilst reducing unnecessary delays/confusion/frustration.
The CIA Triad is used to describe the impact on resiliency – it should be SRP instead
While I (and many others) can argue that a better alternative might exist for OT (e.g., Safety-Reliability-Productivity) using IT’s Confidentiality-Integrity-Availability (CIA) triad is often a starting point for typical enterprise security audits. To be fair, Public Safety Canada attempted to explain the issues through the following quote, but they did not emphasize the message strongly enough:
“However, in OT environments, there is less of a focus on “confidentiality” as there is a need for lower latency and 100% uptime (i.e., “availability”). The C-I-A Triad also differs for OT environments in that there is interdependence amongst organizations, which could have a cascading effect on other systems, stakeholders and even nations.” – PSC guidance
Public Safety Canada noted the concept of interdependence for the purposes of things such as utilities, but it’s missing the boat; IF you are a crown corporation and you provide water, sewer, and electricity, this statement might make more sense. But the real guts of it are:
- If you are in the business of pumping oil out of the ground that eventually will make it into various products, keep doing that
- If you are in the business of creating widgets that are used in a larger system of widgets for just in time manufacturing (e.g., an engine in a car), keep doing that
- If you are in the business of making cheese, you need to be getting your dairy, making products, and getting those to stores, keep doing that
- If you contribute to the economy and even micro-economies, keep doing that
The above points are exactly that – remaining profitable, reducing losses, and keeping people safely employed. It SHOULD be clear to the reader, but I hope by calling out that messaging, people that are interested in solving problems in their organizations will recognize the importance that OT and IACS systems have within their organizations and those depending on them.
Furthermore, I believe there is truth to “know your environment and build your response kits accordingly.” Again, another understatement being downplayed: holistic knowledge – know your assets, know your business, and have adequate protection and response tools and techniques dialed in. If you cannot survive a downtime of X at a burn rate of Y, be prepared, or be prepared to exit the market.
There are many places to describe the differences between IT and OT, but as PSC says, it is not to create a division but understanding. You don’t call grandma old and leave her in the corner to fend for herself, and then turn around to ask grandma to mind the kids and house while you are trying to make ends meet. When families are strong, they succeed.
Breaking down the OT cyber security incident response development process
Public Safety Canada provides seven steps in their guidance, providing good points on:
- Leveraging the skills, technologies, and capabilities you already have
- Creating leads and examine your existing processes
- Defining an incident and how to know the difference between an event and an incident
- Knowing how to classify an incident, and have severity matrixes
- Understanding how to engage your team and under what conditions
- Knowing how to communicate when everything goes sideways
- Determining ecessary response actions (which feels like it was adapted from the RCMP’s harmonized threat assessment methodology, to be honest)?
- Acknowledging how the CRIP will fit within a crisis management plan
The latter point is very distinctive and different from several other guidance strategies out there. It’s acknowledging, again, something most businesses have in IT, but also for physical disasters. Use what you have and adapt what you can. People like familiar things and reframing cyber into existing disaster/crisis management plans (DRP/CMPs) is very helpful inside of slow-to-change organizations.
Maintaining the CIRP
While no process or governance is complete without optimization and evolution, CIRPs are no different. However, before the maintenance part, it needs to be:
- Tested with end-to-end human trials (not just tabletops)
- Adjusted with relevant changes (and technology investments perhaps)
- Tested again REGULARLY as if it was a “fire drill”
Protect and Recovery language is massively missing in this document, and I recognize that as a gap in many frameworks, albeit a critical piece of frameworks such as the NIST CSF. Yes, Incident Response fits into the Respond category of NIST as part of most frameworks, but this document is not clearly outlining the relationship to other necessary activities to it’s left or right (e.g., Protect, Detect, or Recover). And it also needs help on describing how to respond to certain types of incidents, but I suspect this is also on the PSC’s mind.
Regardless, this is a great piece of guidance to link a bunch of concepts together. Is it complete? No. Is it tainted by the energy utilities? Yes, BUT it is helpful, and an important piece of a well-rounded cyber security program. I cannot wait to see more as this document evolves.
In addition to this document, feel free to check out these resources for further information:
- NIST SP800-61r2 (Computer Security Incident Handling Guide)
- Four Keys to Effective ICS Incident Response
- CISA National Cyber Incident Scoring System
- Best Practices for Victim Response and Reporting of Cyber Incidents