Troubleshooting Guide

This comprehensive guide helps you quickly resolve common issues in Harness Chaos Engineering.

Quick Reference

Common Issues by Category

Infrastructure Connection - Can't connect to chaos infrastructure
Experiment Failures - Experiments stuck or failing
Windows Issues - Windows-specific problems
Discovery Problems - Service discovery not working
Known Limitations - Current platform limitations

Essential Debug Commands

# Check infrastructure status
kubectl get pods -n <namespace>

# View infrastructure logs
kubectl logs -f <pod-name> -n <namespace>

# Describe problematic pods
kubectl describe pod <pod-name> -n <namespace>

# Check experiment status
kubectl get chaosengines -n <namespace>

Emergency Procedures

Experiment Stuck in Production:

Stop the experiment immediately from UI
Check target application health
Review blast radius settings
Contact support if critical

Infrastructure Unresponsive:

Check cluster resources: kubectl top nodes
Restart chaos infrastructure pods
Verify network connectivity
Check firewall/security groups

Kubernetes Infrastructure Troubleshooting

Unable to Connect to Kubernetes Infrastructure Server

Severity: High | Time to Fix: 10-30 minutes | Difficulty: Beginner

Most times, chaos infrastructure errors are due to issues with the chaos infrastructure setup.

Quick Diagnosis:

Infrastructure pods not running
Network connectivity issues
Authentication problems

Solution Steps:

If you are unable to connect to the Kubernetes infrastructure server, try the following:

Use ping on the subscriber or any other pod to test if the response times for app.harness.io or another URL are reasonable and consistent.
Use traceroute on app.harness.io to check the network route.
Use nslookup to confirm that the DNS resolution is working for app.harness.io.
Connect using the IP address for app.harness.io (you can get the IP address using nslookup). For example, http://35.23.123.321/#/login.
Check for local network issues, such as proxy errors or NAT license limits.
For some cloud platforms, like AWS EC2, ensure that the security groups allow outbound traffic on HTTPS 443.

Connection Fails After Namespace Setup

Severity: Medium | Time to Fix: 5-15 minutes | Difficulty: Beginner

When you set up the namespace and pods and connect to the Kubernetes infrastructure, it fails to connect.

Symptoms:

Infrastructure shows as "Disconnected" in UI
Pods are running but not communicating
Timeout errors in logs

Troubleshooting Checklist:

Check the status of your chaos infrastructure on your cluster:
```
kubectl get pods -n <namespace_name>
```

Check the chaos infrastructure logs:

kubectl logs -f <pod-name> -n <namespace_name>

If the chaos infrastructure is not in a healthy state:

kubectl describe pods <pod-name> -n <namespace_name>

Check the logs of all pods in the namespace.

Cluster in GCP Has Un-schedulable Pods

GCP might throw an error stating that a cluster has pods that can't be scheduled. This may occur if you don't have sufficient space in your Kubernetes cluster.

GCP Cluster Issue

Workaround:

If your Kubernetes cluster isn't big enough and doesn't have autoscaling enabled, it can't run the delegate (the remote component that helps access your K8s cluster and inject faults).

To fix this issue:

Add more space or turn on autoscaling
Wait for the cluster to restart
Reconnect to the cluster

Re-run the following command:

kubectl apply -f harness-chaos-enable.yml

Discovery Agent Troubleshooting

If the Discovery Agent is unable to discover services:

Check Pod Status: Fetch the pods in the dedicated namespace in your target cluster. For example, if you have created a namespace harness-chaos in your target cluster:
```
kubectl get pods -n harness-chaos
```
Get Pod Metadata: If you see a particular pod failing or in some erroneous state:
```
kubectl describe pod <Pod-Name> -n harness-chaos
```
View Pod Logs: View the logs of that particular pod:
```
kubectl logs -f <Pod-Name> -n harness-chaos
```
Check Delegate Logs: If the logs suggest that no resources were found in the dedicated namespace, check the logs of delegates installed:
```
kubectl get pods -n harness-delegate-ng
```

Environment Variable and Secret Usage in Command Probe Source Mode

You can use secrets and environment variables in the source mode of the command probe using the manifest in the following manner:

source:
  env:
  - name: name
    value: test
  volumes:
  - name: volume-secret
    secrets:
    - name: vm-credentials
  volumeMount:
  - name: volume-secret
    mountPath: /etc/volume-secret

Memory Stress Fault StressNG Flag Usage

You can use the stressNGFlags attribute to provide additional flags to the stress-ng command used in the memory stress fault. For example:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  annotationCheck: "false"
  appinfo:
    appns: "default"
    applabel: "app=nginx"
    appkind: "deployment"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-memory-stress
    spec:
      components:
        env:
        - name: MEMORY_CONSUMPTION
          value: '500'
        - name: TOTAL_CHAOS_DURATION
          value: '60'
      definition:
        chaos:
          experiment: linux-memory-stress
          stressChaos/inputs:
            duration: 30s
            workers: 1
            memory: 5m
            stressNGFlags: "--vm-populate"

tip

The --vm-populate in the above manifest populates the memory, thereby stressing it. It is an example to demonstrate how you can utilize the stressNGFlags flag attribute.

Executing an Experiment Moves it to QUEUED State

If your experiment moves to a QUEUED state:

Check if there are sufficient resources available in the cluster.
Check the logs of the control plane components, such as Chaos Manager and Kubernetes IFS.

Experiment Directly Moves to ERROR State with No Execution Data

If you execute a chaos experiment but it directly moves to the ERROR state without providing any execution data, it means that the experiment was successfully sent to the subscriber, but the subscriber failed to start the experiment.

To verify this:

Go to Chaos Experiments in the UI and navigate to the experiment you created.
Select the ⋮ icon and select View runs. Navigate to the specific run, select the ⋮ icon, and then select View Run Report.

Debug Steps:

As the first step, check the workflow controller logs.
- If the logs suggest that the experiment run name exceeds the limit, change/reduce the length of the experiment name.
If the experiment doesn't have a label as an instance ID (aka infrastructure ID), check if you deployed the experiment manually or generated it from the UI (frontend).

Experiment Step Node is in PENDING State

If one of the experiment step nodes is in a PENDING state, check:

Resource availability in the cluster
Node scheduling constraints
Pod security policies
Network policies that might be blocking communication

Live Logs of an Experiment Result in an Error

Debug Steps: If you try to access the live logs of your experiment run but you receive an error instead or nothing shows up:

Check the sidecar container of the experiment pod to know the status of the container.

Windows Chaos Infrastructure Troubleshooting

Windows Troubleshooting Quick Start

Before you begin:

Ensure you're running as Administrator
Check Windows version compatibility (64-bit required)
Verify user account permissions

Service Created but in Stopped State

Error Message:

The service did not start due to a logon failure.

Solution (Method 1) - Grant Logon as a Service Permission:

Open Local Security Policy.
Navigate to User Rights Assignment.
Find "Log on as a service" and add the user to this policy.
Apply and save the changes.
Start the Service: Restart the WindowsChaosInfrastructure service from the Services tab in Task Manager.
Check Logs: If the issue persists, refer to the log file at C:\\HCE\Logs for more details.

Solution (Method 2) - Manual Service Configuration:

Open Task Manager (Ctrl + Shift + Esc).
- Press Ctrl + Shift + Esc to open the Task Manager or search Task Manager.
- Switch to the "Services" tab.
Locate the service.
- Find the "WindowsChaosInfrastructure" service.
- Right-click on the service and select "Open Services".
Modify Service Properties:
- In the Services window, locate "WindowsChaosInfrastructure" again.
- Right-click on it and choose "Properties".
- Go to the "Log On" tab.
Provide User Credentials.
- Enter the credentials of the user account that should run the service.
- Select Apply. You should receive a confirmation that the account has been granted logon as a service right.
Start the Service: Apply the changes and start the service. The service should now enter a running state.

Installation Failed with "The Specified Service Already Exists"

Error Message:

The specified service already exists.

Solution:

Run the uninstallation script to remove the previous installation.
Manually remove the previous installation:
- Delete the service: sc delete WindowsChaosInfrastructure
- Remove the chaos directory: C:\\HCE
Reinstall: After cleanup, re-run the installation script.

Installation Failed with "Account Name is Invalid"

Error Message:

The account name is invalid or does not exist, or the password is invalid for the account name specified.

Solution:

Verify account name: Ensure that the account name provided in the -AdminUser flag is correct and exists on the system.
Correct Syntax: Use the correct syntax, for example, .\\Administrator for the local administrator account.

Service Fails to Create with Exit Code 216

Solution:

Check Windows version: The error indicates incompatibility with the Windows version. Currently, only 64-bit versions are supported. Support for 32-bit versions is planned for future releases.

Default Command Fails with "Could not create SSL/TLS secure channel"

Solution:

Force TLS 1.2 by adding the following line to the beginning of your command:

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12

Example:

powershell -Command "& { [Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12; Invoke-WebRequest -Uri 'https://app.harness.io/public/shared/tools/chaos/windows/1.32.0/install.ps1' -OutFile 'install.ps1' -UseBasicParsing; .\install.ps1 -AdminUser '.\uditgaurav' -AdminPass 'password@123' -InfraId '59cedc73-c544-432a-99e7-ec20b2fc73c0' -AccessKey 'ow03gxzvkjdck9ws5jjmznu2gzx7h0ep' -ServerUrl 'https://shubhamch.pr2.harness.io/chaos/mserver/api' }"

Known Issues

Incorrect Upgrade Prompt

The Upgrade now button appears even when the Kubernetes infrastructure is on the latest version due to the API not returning the correct update status.

Copy to Clipboard Issue

If you try to access the Harness Self-Managed Enterprise Edition (SMP) portal over an HTTP-based connection, the Copy to clipboard facility will not work. This facility works only when you access SMP over an HTTPS-based connection.

Status Timeout Check Issue

For faults such as node network loss, kubelet service kill, the default status check timeout is 180 seconds. If you have specified your chaos experiment duration to be less than 180 seconds, the chaos experiment can fetch the status of the helper pod once the duration is complete, and this will be within the timeout threshold. If the duration of the experiment is more than 180 seconds, the status check times out even before the fault completes, and results in the fault erroring out.

Solution: Increase the status check timeout duration so that the experiment completes and then fetching the status of helper pods.

Enable/Disable Linux Resilience Probe

Similar to the Kubernetes probe, you can enable or disable a Linux probe from the probe table. But when you do so, two fields (that have empty values), type and attempt, are also added to the Linux probe.

When you manually edit a Linux resilience probe manifest for parameters type and attempt, the edited values will not reflect in the updated manifest. Instead, the manifest reflects values from the database.

Example:

For a Linux experiment, if you have a probe named abc:

probe:
  - name: abc
    mode: SOT

When you enable or disable the probe, the parameters reflect as follows:

probe:
  - name: abc
    type: ""
    runProperties:
      attempt: 0
    mode: SOT

Windows Chaos Infrastructure Limitations

Integration With Pipeline

Harness CE platform provides native integration for chaos experiments with the CD module. We currently don't support the integration of Windows chaos experiments with Harness CD pipelines.

GameDay Support

Harness CE currently offers GameDay support for orchestrating and running multiple Kubernetes experiments across various infrastructures in a coordinated manner. However, GameDay support for Windows chaos experiments is not yet available.

Inclusion in ChaosGuard Policies

ChaosGuard is a Harness CE feature that enhances the security and control of chaos experiments. Currently, ChaosGuard policies don't support Windows chaos experiments.

Scheduled or Cron Experiments

Currently, Windows chaos experiments do not support scheduled or cron-based executions; you can't set experiments to run automatically at specified intervals.

Auto Upgrade Support

The Windows Chaos infrastructure currently doesn't support auto-upgrades. For every upgrade, you need to manually upgrade by uninstalling the current infrastructure first, using the provided uninstallation script, and then re-installing it with the desired infrastructure version.

Known Limitations of Resilience Probes

Command probes in the source mode for Kubernetes is available for both SMP and Harness CE SaaS.
Command probes in the source mode is not available for Linux in Harness CE SaaS.
In SMP (self-managed platform), command probe in the source mode is only available for Linux.

For further assistance, please refer to the documentation or contact Harness Support.

Quick Reference​

Common Issues by Category​

Essential Debug Commands​

Emergency Procedures​

Kubernetes Infrastructure Troubleshooting​

Unable to Connect to Kubernetes Infrastructure Server​

Connection Fails After Namespace Setup​

Cluster in GCP Has Un-schedulable Pods​

Discovery Agent Troubleshooting​

Probe Related Troubleshooting​

Environment Variable and Secret Usage in Command Probe Source Mode​

Fault and Experiment Related Troubleshooting​

Memory Stress Fault StressNG Flag Usage​

Executing an Experiment Moves it to QUEUED State​

Experiment Directly Moves to ERROR State with No Execution Data​

Experiment Step Node is in PENDING State​

Live Logs of an Experiment Result in an Error​

Windows Chaos Infrastructure Troubleshooting​

Service Created but in Stopped State​

Installation Failed with "The Specified Service Already Exists"​

Installation Failed with "Account Name is Invalid"​

Service Fails to Create with Exit Code 216​

Default Command Fails with "Could not create SSL/TLS secure channel"​

Known Issues​

Incorrect Upgrade Prompt​

Copy to Clipboard Issue​

Status Timeout Check Issue​

Enable/Disable Linux Resilience Probe​

Windows Chaos Infrastructure Limitations​

Integration With Pipeline​

GameDay Support​

Inclusion in ChaosGuard Policies​

Scheduled or Cron Experiments​

Auto Upgrade Support​

Known Limitations of Resilience Probes​

Additional Resources​

Quick Reference

Common Issues by Category

Essential Debug Commands

Emergency Procedures

Kubernetes Infrastructure Troubleshooting

Unable to Connect to Kubernetes Infrastructure Server

Connection Fails After Namespace Setup

Cluster in GCP Has Un-schedulable Pods

Discovery Agent Troubleshooting

Probe Related Troubleshooting

Environment Variable and Secret Usage in Command Probe Source Mode

Fault and Experiment Related Troubleshooting

Memory Stress Fault StressNG Flag Usage

Executing an Experiment Moves it to QUEUED State

Experiment Directly Moves to ERROR State with No Execution Data

Experiment Step Node is in PENDING State

Live Logs of an Experiment Result in an Error

Windows Chaos Infrastructure Troubleshooting

Service Created but in Stopped State

Installation Failed with "The Specified Service Already Exists"

Installation Failed with "Account Name is Invalid"

Service Fails to Create with Exit Code 216

Default Command Fails with "Could not create SSL/TLS secure channel"

Known Issues

Incorrect Upgrade Prompt

Copy to Clipboard Issue

Status Timeout Check Issue

Enable/Disable Linux Resilience Probe

Windows Chaos Infrastructure Limitations

Integration With Pipeline

GameDay Support

Inclusion in ChaosGuard Policies

Scheduled or Cron Experiments

Auto Upgrade Support

Known Limitations of Resilience Probes

Additional Resources