This document is a process guide and reference for KPMG GSOC analysts for the process of triage during the security incident handling process. The purpose of triage is to maximize the ability of GSOC analysts to rapidly identify the highest priority security incidents, maximizing the amount of potential security incidents that are reviewed, and minimize false positives.
This process document does not go into elaborate detail, provide low-level technical procedures, or address all potential outcomes or failure cases. Analysts are expected to maintain and rely on technical work notes and controls to guide their execution of this process, and expected to use their best judgment when minor adaptations are needed to execute triage.
This document is ultimately owned by the GSOC Director. He or she is responsible for ensuring that this is updated and maintained in response to feedback from GSOC Analysts.
This document is intended for all GSOC members (see Section 2.9 – Responsibilities).
This document should be reviewed on at least a 6 month basis, or at any time that the Constraints or Assumptions (section 2.8) are believed to have changed.
Exceptions to this process can be temporarily authorized by L2 analysts. Documentation of any process exceptions must be provided to GSOC management for potential process modifications.
Failure to adhere to this process must be reported to the lead shift L2 analyst, with a courtesy copy to the L3 and GSOC Operations Manager.
The following roles have overall responsibility for elements of this process. Please note that these are not comprehensive listing of responsibilities of each of the following roles, but represent these roles’ specific responsibilities to support the triage process.
The GSOC Director is ultimately responsible for the proper functioning of the triage process, and to ensure that supporting processes are also healthy. He or she must ensure that this process is reviewed and adjusted on at least a 6-month basis, and authorize process adjustments on a more frequent basis if needed.
The GSOC L1 Analyst is the primary user of the triage process. He or she will triage all security incidents detected via automated or non-automated means, escalating when the triage process is not functioning properly, and identify and communicate issues with the triage process.
The GSOC L2 Analyst will accept the output of triage security incidents when escalated to him or her, provide support to the L1 Analyst in the event of the triage process not functioning properly, and identify and communicate issues with the triage process.
The GSOC Tooling Engineer will ensure that security incident detection content will maximize the value of automation to accurately prioritize security incidents prior to the triage process. This will include the inclusion of threat intelligence indicators and internal device information (e.g., active directory identities, VIP lists) into detection and prioritization content. He or she will also ensure that information about false positives identified by Analysts during and after the triage process is incorporated into updated content rules.
The GSOC Threat/Intel Analyst will ensure that continuously updated threat intelligence (e.g., indicator lists) is provided to support GSOC Tooling Engineer-designed automation. The GSOC Threat/Intel Analyst will also provide internal GSOC intelligence notes as needed to support the triage process.
- Content Management Process
- Detection Optimisation Process
- Escalation Process
- Intelligence Management Process
- Escalation process
- Detection Optimisation Process
The triage process takes as input automated and manually detected security events, which are then rapidly analysed by the GSOC L1 Analyst in order to reduce false positives rate and ensure accurate prioritization of security incidents.
The goal of triage is to ensure that the following criteria have been met:
- Triage of the security incident occurs within 15 minutes.
- False positives are identified and the false positive security incidents are updated and closed.
- The priority of the security incident is confirmed, or updated.
- Security incidents which require additional investigation are identified and placed in the appropriate queue.
A key predecessor to triage is the automation of initial detection and initial prioritization of the security incident. The specific design of the automation rules for this falls outside the triage process, but the impact on the triage process of the output of this automation is significant, so the supporting processes are discussed below.
The identity of the target (human user, compromised system, or data) is an essential part of prioritization. In the context of automated prioritization, tooling engineers must ensure that the following information is incorporated into automated security incident prioritization. Analysts should ensure that they understand the automation rule logic sufficiently to understand the origin of an automated prioritization decision, and can confirm it as well. In event of false positives, it is important to identify the specific gaps in automated prioritization mechanisms.
- Target user identity/role. The significance of the role of a targeted user (i.e., VIP, privileged user) can normally be determined from active directory group.
- Target system purpose/role. The significance of the role of a targeted system (e.g., Internet facing, production, data repository, etc.,) can sometimes be identified from member firm change management records, network maps, and active directory.
- High-value data location. The locations of high-value data (e.g., specific file shares or databases containing customer data) can sometimes be identified from risk registers and change management records.
Attacker identity is an important part of the prioritization process. Normally, these identities are based on indicators (threat logic), and their relationship to an attributed (identified) attacker has limited confidence. However, when these attacker identities can be tied to a known campaign or malware infrastructure, this can be used to assist in prioritization. Analysts should understand the threat intelligence databases, watch lists and/or information repositories sufficiently to analyse prioritization decisions based on this automated logic.
The initial goal for the triage process is to have completed it within 15 minutes. There will be occasional security incidents where triage within this timeframe is challenging due to lack of contextual data available to make an assessment, or a delay in getting access to a data source to resolve open questions. In this case, the analyst must nonetheless do their best to estimate a priority based on available information.
For triage situations where it is challenging to make a clear decision about the priority of a ticket, and there is no clear path forward to resolve an open question, the default triage decision should be to assume the worst case scenario.
For example, in the case of a security incident with the following evidence:
- C2 traffic to an IP associated with a known nation-state campaign against KPMG (would normally indicate at least a P2)
- Traffic appears to involve so many nodes on such a frequent basis that it is unlikely to be a nation-state campaign (indicates a P3 or false positive).
- There is no way for the analyst to rapidly get additional information about the traffic.
This should then be assumed to be a P2 Priority Ticket.
This is not intended to imply that any missing information should be assumed to be the worst-case evidence that would result in a higher priority. The intent is that if there’s conflicting evidence with some pointing to a higher priority, the higher priority should be assumed.
The final result of the triage process of a security incident shall always be one of the following states:
- Prioritized as P2 or higher, and placed in the L2 queue for additional investigation
- Prioritized as P3, and placed in the L2 queue for additional investigation
- Classified as a False Positive, and the security incident closed.
This section describes the key steps required during the triage process to prioritize a security incident.
Upon receipt of a new security incident, an initial review should be conducted to confirm why the security incident received the initial prioritization. For example, the content rule which generated the security incident should be examined to enable the analyst to understand the logic behind the initial classification. If there is some doubt about the value of the classification (perhaps the use of an IP watchlist that has previously generated some false positives), then this should be examined during detailed prioritization.
Detailed prioritization contains the detailed steps to examine elements of the security incident that will help more accurately prioritize the security incident. Depending on which content rule fired to generate the security incident, it is not always necessary to examine all elements during this phase. For example, a security incident detection based on a watchlist IP hit may not require additional analysis of the attacker. In this case, the attack should be examined (such as the URL within the log) to help provide evidence to refute or confirm the existence of an attack, and the target should be analysed to see if the target belonged to a high-interest category.
Attack analysis is an attempt to determine the type of attack that was involved in a security incident. This attack information will drive remediation behaviours. Typical ways to determine the type of attack include:
- Examination of the initial security alert (content will include metadata such as description fields about associated attacks sent to SECOPS – written by tooling engineers at the time of content creation)
- Examination of logs associated with alert (i.e., examination of HTTP URL information associated with a IP watchlist hit)
- Examination of internal and external threat intelligence information
Attack information will be associated with the security incident in SECOPS tool using the VERIS categorization scheme.
Attacker analysis is an attempt to determine the type of attacker which was involved in a security incident. Attacker analysis will drive prioritization decisions and remediation behaviours. Typical ways to determine the type of attack include:
- Examination of the initial security alert. In particular, indicator-based detections will often be associated with attackers (some content will include notes about associated attackers)
- Examination of internal and external threat intelligence information
Target analysis is an attempt to determine the importance and/or value of the target which was involved in a security incident. Target analysis will drive prioritization decisions and remediation behaviours. Typical ways to determine the type of target include:
- Examination of the initial security alert (some content will include notes about target identity).
- Examination of internal data repositories (Active Directory, Change Management Systems).
- Examination of previous security incidents.
- Calls to member firm IT departments (dependent on member firm)
Impact and Intent Analysis is the analysis of various details associated with a security incident which allow a conclusion to be drawn about attacker intent, or the ultimate impact of the security incident. This is a form of analysis which may involve combinations of multiple elements of detail to assist with prioritization of the security incidents. Some typical ways to determine the either of these criteria include:
- Identification of the search behaviours of an attacker (i.e., if conducting searches for specific types of data) can assist with understanding that an attempted data breach is underway
- Identification of the content of data or specific file stores which has been exfiltrated, copied, or attempted to be accessed by the attacker.
- Access or attempted access of binaries/files, registry settings, DLL’s, or other system artefacts which assist with identification of a specific malware intent (e.g., attempted access to the Windows SAM, an attempt to start or stop a service).
- Attempt to conduct a denial of service or affect availability on a system or service that will affect a significant amount of KPMG systems, or affect KPMG systems that are customer-facing.
Some security incidents are part of a pattern related to other security incidents. Conducting an analysis for related activity, or for other existing security incidents which may be associated with a current security incident, can help identify P2 or higher priority security events that are part of a larger problem, or identify recurrence of an existing problem. It is important to carefully choose the criteria by which security incidents are associated. In general, many garden variety malware-related security incidents (e.g., Zeus, Conflicker) may appear related due to shared C2 indicators, or share anti-virus/IDS definitions, but may be only coincidentally the same. The purpose of this analysis is to identify those security incidents which are clearly related. Some examples of methods by which one identifies related security incidents include:
- Looking for indicators or identifying characteristics (IP addresses, system names, user names) associated with the current security incident in historical log data.
- Looking for indicators or identifying characteristics (IP addresses, system names, user names) associated with the current security incident in the current or historical security incidents.
The concept of security incident ownership is particularly important for security incident handling at KPMG. Due to the limited authority or ability of the KPMG GSOC to investigate or remediate security incidents, it is essential that the affected member firms be identified to enable either additional investigation or remediation of a security incidents. This requirement is particularly important given the potential for data privacy implications of any security incident.
The primary step will be the identification of the affected member firms. Normally, a security event source can be tied to a specific VLC, enabling rapid identification of the associated member firm (meta-information based on VLC identity). In some cases, particularly in the case of smaller organizations using shared infrastructure, the identification of the affected member firm may depend on Active Directory information.
If there is doubt about the member firm associated with a security incident, the GSOC must confirm security incident ownership prior to communication of any items with potential data privacy implications to a member firm.
During triage, or during follow-on investigation, it may be necessary to identify which support organization within a member firm is capable of providing additional information in order to allow a security incident to be properly prioritized or identified as a false positive. In addition, some firms may not have a dedicated security organization capable of conducting remediation. In this case, it may be necessary for the GSOC to work directly with member firm support organizations.
Appropriate support organization(s) associated with member firms will be identified or highlighted during the on-boarding process, and this data stored in the GSOC Internal Sharepoint Site. The limitation/flexibility of what support POC’s can be contacted by the GSOC will be determined during the Member Firm Onboarding process. While it is possible for the GSOC to potentially contact more senior support personnel responsible within a member firm as a result of following a support escalation tree, the GSOC should not under any circumstances interact with a 3rd party firm outside of KPMG, Contact to 3rd parties (such as service providers for a member firm) should always be conducted through the support POC of the member firm.
Due to the timelines associated with the triage process, the GSOC will normally only be able to leverage member firm support organizations for information if they have established working relationships with them (i.e., previously identified MF teams or personnel which can be contacted for immediate feedback via tools such as Lync or phonecall).
Large member firms will normally have a dedicated security organization, with some level of responsibility for security incident handling. The identification of these points-of-contact will normally occur during the on-boarding phase. However, as with the identification of the support organization, it is expected that the GSOC will occasionally have to conduct internal research, or follow escalation chains, in order to reach the correct personnel to effectively investigate or hand-off a security incident for remediation.
The restriction on the ability for the GSOC to interact with 3rd party firms (external service providers) is identical to section 5.3. Under no circumstances will the GSOC interact with a 3rd party firm without the involvement of the associated member firm.
Due to the timelines associated with the triage process, the GSOC will normally only be able to leverage member firm security organizations for information if they have established working relationships with them (i.e., previously identified MF teams or personnel which can be contacted for immediate feedback via tools such as Lync or phonecall).
The final step of the L1 triage will always result in either escalation to the L2, a false positive determination, or a decision to conduct additional investigation. Each of these final results is discussed in further detail below.
Escalation from the GSOC L1 analyst will occur for one of two reasons:
- A security incident has been determined to be a P2 or higher
- Challenges with internal tools, data, process, or workload prevent the GSOC L3 analyst from completing triage.
Following escalation the GSOC L2 analyst will ensure either continue the triage process or identify a solution for the challenge facing the L1 and de-escalate it to him or her with instructions for how to further resolve the security incident. In the case of a P2 or higher security incident, the L2 will always execute the remaining steps of triage to either categorize it as a false positive, determine it is actually a P3 and return to the L1 analyst, or put it in the queue for further investigation.
In the case that the GSOC L3 analyst identifies a security incident that is a false positive, the GSOC L3 Analyst will annotate the security incident with the method by which a false positive was determined, as well as the identification of any intelligence source that led to the incorrect detection, and close the security incident as a false positive. In the case where there is a recurring or repeated false positive that affects an analyst’s workload, the L3 analyst should directly contact a L2 or tooling engineer via the content management application to request a change or adaption to security content rules to reduce the particular false positive case.
Following triage, there will normally be a requirement for additional investigation of the security incident. The triage process should will only endeavour to rapidly confirm priority and determine ownership of a security incident.
The following mechanisms will be some of the primary tools used to support the analysts during the triage process. It is expected that detailed instructions for the use of each of the following mechanisms will be provided on the GSOC Internal Sharepoint Site.
A primary resource for triage and investigation of security incidents will be internal directory stores. The vast majority of systems and users that will be targeted will be located within Active Directory. Searches of active directory can be conducted with either GUI, command-line, or scripted searches. Some of the data that will be derived from active directory will include:
- Determining system names/location/business units of specific internal IP address
- Determining systems associated with users and vice-versa
- Identifying logged-in users.
- Determining system type (server, workstation, version, etc.,)
- Determining system purpose (based on GPO’s and directory extensions for specific systems)
- Identifying user role/identity (based on group, business unit OU, etc.,)
- Identifying privileged users.
Domain-name tools will be used for similar purposes to active directory, though primarily for the purpose of identifying system information. Searches of domain name tools may occur via basic reverse DNS searches, access to a domain name system management GUI or command line. Some examples of data that can be derived from Domain Name Tools include:
- Determining system names/location/business units of specific internal IP address
- Determining system type (server, workstation, etc.,) based on naming convention
- Determining system purpose based on naming convention
Network management and monitoring tools can be used to identify additional information about targeted systems. In particular, because network monitoring tools are normally employed primarily on high-value systems, and often contains context about system purpose, it can be very valuable for determining the potential impact of a security incident.
A further capability, if the right permissions are offered by network management teams, is the potential to do live queries of network devices to gather information about conditions on endpoints. Examples of this include SNMP queries of switches for CAM tables, querying active sessions on firewalls, and finding current bandwidth utilization.
Change management tools can be used to identify user identity/role and system identity and purpose. Access to Remedy or equivalent tools can also enable an analyst to confirm whether a potential security incident is actually the result of a planned system change.
Examining historical log behaviour is a valuable tool for confirming whether a currently investigated security incident is normal or abnormal activity. Some example mechanisms for using this type of data during security incident triage include:
- Confirming company-wide traffic to suspected C2 nodes, both recently and historically
- Searching past traffic patterns of the same system in previous days
- Looking for similar behaviour by similar accounts or systems
Using tools like Robtex and Google can provide valuable insight about current understanding of malware behaviours, C2 IP’s, and other context associated with a security incident. It is important that analysts be careful about information they put into Internet search engines, in order to minimize potential compromise of internal KPMG information (e.g., don’t paste the content of HTTP C2 string associated with a suspected nation-state threat into any search engine)
It is expected that the GSOC will develop its own mechanisms for tracking and identifying potential security incidents. Using these internal notes, intelligence repositories, and the like are essential steps in reviewing potential security incidents.
- Constraints and Assumptions
The purpose of this appendix is to describe significant constraints and assumptions that are the key drivers for the design and content of this process. The purpose of identifying these key constraints and assumptions is to ensure that when constraints change or assumptions are disproven, that the processes are examined to ensure that they still apply and are optimized for the goals of the GSOC.
The current design of the KPMG GSOC is primarily focused on network-artefact collection of data rather than collection from endpoints. Limited endpoint information may be available when endpoints anti-virus logs are included in GSOC logging and monitoring systems, and some endpoint information may be available from directory services, domain name services, or member firm change management records. On a case-by-case basis, it may be possible for GSOC analysts to gather some additional information using direct outreach to member firm IT teams, but may be time-consuming and customized based on member firm requirements.
The current expectation for false positive rates is that no more than 50% of all automated security incident detections that enter the triage process will be false positives, and that no more than 5% of all security incidents passed to member firms will be false positives. These false positive rate requirements will significantly influence analyst decisions during triage, and during any follow-on analysis. These represent initial proposed limits, and will change over time.
The KPMG GSOC is not expected to have responsibility or authority for remediation or response to security incidents, but must conduct investigations following initial detection to enable sufficient information to be passed to a member firm to enable them to know what action to take.
The current definition for P4 security incidents will likely result in a very large number of P4 security incidents. Accordingly, it is not expected that P4 priority security incidents will be included in the triage process. It may well be that, during triage, a P4 security incident is associated with the P3 or higher security incident that is being examined for triage, but it is not intended to be the subject of the triage process.
A significant number of systems and/or automatic filtering systems will be developed by the GSOC to simplify the detection and triage processes. However, specific capabilities of this automation has not yet been confirmed. Some of the key initial capabilities that will be phased in over time:
- A threat intelligence tool in place to simplify tracking of indicators
- An initial list of high-value personnel or systems integrated with automated detection systems
The current set of events from the logging and monitoring solution will probably be heavily driven by data derived from IP addresses or URL’s. This type of detection makes triage challenging, as the existence of a URL and/or IP-connection may rely on additional data to confirm an security incident. The triage process will be optimized to address this consideration.