Skip to main content

Incident Management

Harness AI SRE's incident management system provides a comprehensive platform for tracking, coordinating, and resolving service disruptions. From incident creation to resolution, teams can manage the entire incident lifecycle with automated workflows, real-time collaboration, and intelligent response procedures.

Overview

Incidents in Harness AI SRE help you:

  • Create and track service disruptions with standardized incident types
  • Coordinate response efforts across teams and stakeholders
  • Document incident timelines with automated event tracking
  • Execute automated remediation steps through integrated runbooks
  • Generate comprehensive post-mortems and action items
  • Integrate with monitoring tools for automatic incident creation
  • Manage escalation policies and on-call notifications

Key Features

Intelligent Incident Creation

  • AI-powered problem description analysis and field auto-population
  • Multiple creation methods: manual, alert-based, and monitoring integration
  • Standardized incident types with pre-configured fields and workflows
  • Quick Start functionality for rapid incident creation

Comprehensive Incident Management

  • Real-time incident details page with editable fields
  • Timeline tracking with automatic event logging
  • Manual key event addition for important milestones
  • Status updates and ownership management
  • Integration with on-call schedules and escalation policies

Automated Response Procedures

  • Runbook execution directly from incident interface
  • Action item creation and assignment with due dates
  • Automated workflow triggers based on incident type
  • Integration with monitoring tools and alert systems

Collaboration and Communication

  • Timeline-based messaging and updates
  • Team notifications and stakeholder communication
  • Action item tracking and assignment
  • Post-incident analysis and documentation

Creating an Incident

Follow this interactive guide to create and manage incidents with AI-powered assistance and automated workflows.

Best Practices

Incident Creation and Classification

  • Choose Appropriate Incident Types: Select the most specific incident type to ensure proper field configuration and runbook association
  • Provide Detailed Descriptions: Use the Quick Start feature with comprehensive problem descriptions to enable accurate AI field population
  • Verify Auto-Generated Fields: Always review and adjust AI-suggested field values to ensure accuracy
  • Set Correct Severity Levels: Align severity with actual business impact and response time requirements

Incident Response and Management

  • Acknowledge Quickly: Respond to incidents promptly to minimize impact and meet SLA requirements
  • Assess Impact Thoroughly: Evaluate affected services, user impact, and business consequences
  • Execute Relevant Runbooks: Use associated runbooks for standardized response procedures
  • Document All Actions: Record every action taken in the timeline for audit trails and learning
  • Update Status Regularly: Keep incident status current to inform stakeholders and trigger appropriate workflows

Timeline and Event Management

  • Add Key Events: Document critical milestones, decisions, and turning points in the incident lifecycle
  • Use Timeline Messaging: Communicate updates and coordination through the incident timeline
  • Maintain Chronological Order: Ensure all events are properly timestamped and sequenced
  • Include Context: Provide sufficient detail in timeline entries for future reference and analysis

Action Item Management

  • Create Specific Action Items: Define clear, actionable tasks with specific outcomes
  • Assign Ownership: Ensure every action item has a designated owner and due date
  • Track Progress: Regularly update action item status and completion
  • Follow Up: Monitor action items through completion to prevent issues from recurring

Communication and Collaboration

  • Use Structured Communication: Follow incident communication templates and standards
  • Update Stakeholders Regularly: Provide timely updates to affected teams and leadership
  • Leverage Integration Channels: Use Slack, Teams, or other integrated communication tools
  • Maintain Professional Tone: Keep all incident communication clear, factual, and professional

Post-Incident Activities

  • Complete Action Items: Ensure all follow-up tasks are completed within specified timeframes
  • Conduct Reviews: Analyze incident response effectiveness and identify improvement opportunities
  • Update Documentation: Refine runbooks, procedures, and incident types based on lessons learned
  • Share Knowledge: Communicate insights and improvements with the broader team

Benefits

  • Streamlined Response: AI-powered incident creation reduces time to response and improves accuracy
  • Standardized Processes: Incident types ensure consistent handling across all teams and services
  • Automated Workflows: Integrated runbooks and action items automate response procedures
  • Complete Visibility: Timeline tracking and event logging provide full incident lifecycle visibility
  • Enhanced Collaboration: Built-in communication tools facilitate team coordination and stakeholder updates
  • Continuous Improvement: Action item tracking and post-incident analysis drive process optimization
  • Integration Ready: Seamless connection with monitoring tools, alert systems, and communication platforms

Next Steps

Getting Started

Advanced Configuration

Best Practices Resources