This topic compares EC2 chaos injection approach for Kubernetes+SSM and Native Linux agent.
Area | Kubernetes agent driven EC2 chaos | Native Linux agent driven EC2 chaos |
---|
Install Prerequisites/Agent Setup | - Installation of the agent needs user to be a cluster-admin OR mapped to cluster role with these permissions.
- SSM Agent is installed (it runs with sudo by default) on the target EC2 instance(s).
- Default SSM IAM role should be attached to the target EC2 instance(s).
- Ensure that you either create a secret with account user credentials or map an appropriate IAM role reference/ARN to the chaos ServiceAccount to carry out the chaos injection.
| Console access to the machine as root/sudo OR Ability to inject processes remotely over SSH as root/sudo. |
---|
Installed Components | The K8s agent comprises the following stateless deployments in a dedicated namespace: subscriber, wf-controller, chaos-operator, exporter along with some secrets and ConfigMaps. | The native Linux chaos agent comprises a systemd-based service (configured with post hook). The agent config, logs and cron configuration are stored in dedicated, predefined paths. |
---|
Dependencies (a combination of upstream Linux and Harness utilities required for chaos injection.) | - They can be installed just-in-time by the experiment OR can be placed into the machine prior (in case of disconnected setups).
- tc, stress-ng, jq, iproute2, tproxy, dns-interceptor
| - Installed as part of the agent installation process
- tc, stress-ng, jq, iproute2, tproxy, byteman
|
---|
Network Connectivity | From Chaos Agent: - Outbound over port 443 to Harness from Kubernetes cluster.
- Outbound over 443 to cloud acc resource endpoints from Kubernetes cluster.
- Outbound to application health endpoints (ones which will be used for resilience validation) from Kubernetes cluster.
- From EC2 Instance: Outbound over port 443 to package repo/Harness S3 endpoints to pull dependencies (in connected mode).
| - Outbound over port 443 to Harness from VM.
- Outbound over port 443 to package repo/Harness S3 endpoints to pull dependencies (in connected mode).
- Outbound to application health endpoints (ones which will be used for resilience validation) from VM.
|
---|
Lifecycle Management | - Availability: Tracked via Heartbeat. Can be scaled down to 0 replicas under idle conditions.
- Upgrade: Automatic and manual upgrades supported.
- Note: Automated upgrades only via Kubernetes Manifests. Helm bundle upgrades are manual/offline.
- Uninstall/Deletion: The "Disconnect" operation from control plane removes the subscriber and configs/secrets involved in auth.
| - Availability: Tracked via Heartbeat. Service can be stopped under idle conditions.
- Upgrade: Only Manual upgrades supported.
- Uninstall/Deletion: Performed via an offline uninstaller utility.
|
---|
Permissions/Access for Chaos Injection | Depends upon the nature of the fault. Master Policy for EC2 faults for all supported faults on EC2. | Run experiments with root user. |
---|
Chaos Experiment Execution | - Max Execution Time: Chaos Duration + Probe Validation Timeout + [~60-120s] (Relatively Higher)
- Note: Involves generation of K8s events and creation of transient pods to carry out the fault business logic, which can add to overall execution time.
- Parallel Fault Support Within Experiment: Yes
- Multi-Infra Support Within Experiment: No
- Support for HTTP Probes: Yes
- Support for Command Probes in Source Mode (custom validation via user-defined container images): Yes
| - Max Execution Time: Chaos Duration + Probe Validation Timeout (Relatively Lower)
- Parallel Fault Support Within Experiment: Yes
- Multi-Infra Support Within Experiment: Yes
- Support for HTTP Probes: Yes
- Support for Command Probes in Source Mode (custom validation via user-defined container images): No
|
---|
Execution Control | - Abort Support: Yes. Internally invokes cancellation of the SSM command (which in turn is a bash script). However, there are some risks of continued operations as highlighted by AWS.
- SSM Agent Crash: Dependent on AWS-native based recovery.
| - Abort Support: Yes. An abort-watcher ensures graceful cancellation of the chaos process.
- Chaos Agent Crash: The agent service is configured with the right hooks (ExecStart/Stop) which removes all residual chaos on the system as a safety measure.
|
---|
Logs | Logs are based off the success of the SSM commands, with a need to explicitly fetch the stdout/stderr. | Custom logs tracking each stage of the fault injection are available. |
---|
OS-Specific Fault Coverage | Not available | Available |
---|
Custom Chaos Support (SSH, Load) | Available | Not available |
---|
APM Integrations for Probes | Supports Prometheus, Dynatrace, Datadog, NewRelic out-of-the-box. | Dynatrace and Datadog supported out-of-the-box. Others can be implemented using custom/command probes. |
---|
Harness Chaos Management Feature Support (Cron, ChaosGuard, Gamedays, CD Integration) | Available | Gameday support available |
---|
Agent Reuse for Managed Service Chaos | Supported | Not Available |
---|