Cross-Project Cloud SQL Failover Using BYOC Injector

Overview

This topic provides how to implement a cross-project failover for Google Cloud SQL instances using a Bring Your Own Chaos (BYOC) injector. The BYOC injector facilitates the simulation of failover scenarios to test the resilience and high availability of Cloud SQL instances across different projects.

Prerequisites

Before proceeding, ensure the following prerequisites are met:

Kubernetes > 1.16
Service account should have editor access (or owner access) to the GCP project.
litmus-admin Kubernetes secret should have appropriate permissions to perform Cloud SQL Failover.
- cloudsql.instances.failover
- cloudsql.instances.list
Harness CE provides two ways of providing permissions to the litmus-admin Kubernetes secret:
- Using Gcloud Service Account Secret
- Using Workload Identity

Mandatory tunables

Tunable	Description	Notes
LIB_IMAGE	Image of the helper pod that contains the business logic for the custom fault.	For more information, go to lib image
COMMAND	Command to execute in the helper pod.	For more information, go to command
ARGS	Arguments to execute in the helper pod.	For more information, go to args

Implementation

You can use one of the following ways to initiate failover on cloud SQL instances:

Using GCP Rest APIs

Google Cloud Platform provides APIs for all its services, allowing users and developers to interact with them programmatically. This flexibility is crucial for specifying the LIB_IMAGE required for chaos injection, particularly when using BYOC. To implement this method, provide scripts in your preferred programming language and build a custom image. For more details, refer to the documentation.

Using gcloud Binary

The gcloud binary offers a range of commands designed for interacting with the Cloud SQL Service, including operations like listing, updating, deleting, and initiating failover. Refer to the documentation for more information.

This method is advantageous since it caters to individuals with varying levels of technical expertise, even those without extensive knowledge of coding languages or APIs. By leveraging a combination of LIB_IMAGE (a Docker image containing gcloud) and ARGS to concatenate the necessary commands into a single directive, you can seamlessly implement this approach.

The choice of method depends on your preference. In this documentation, you will learn the second approach, that is, the gcloud binary.

You will formulate a unified command for chaos injection, which can be specified within the ARGS tunable, alongside any image equipped with gcloud.

Construct the Command for Cloud SQL Failover

GCP provides a specific command for SQL failover, which requires two imputs (environment variables), which can be executed as follows:

gcloud sql instances failover "${CLOUD_SQL_INSTANCE_NAME}" --project="${CLOUD_SQL_PROJECT}" -q

where

CLOUD_SQL_INSTANCE_NAME: is the name of the designated SQL Instance.
CLOUD_SQL_PROJECT: is the name of the GCP Project in which the SQL instance is located.

To confirm the occurrence of chaos injection, verify the zone of the SQL instance before and after the chaos injection. This verification is crucial as the zone would change due to the chaos injection. To achieve this, use the following command:

gcloud sql instances describe "${CLOUD_SQL_INSTANCE_NAME}" --project "${CLOUD_SQL_PROJECT}" --format="get(gceZone)"

The command above provides a detailed description of the specificed SQL instance and displays the zone of the specific instance as a single output, which simplifies the output log for easy analysis.

The next step would be to integrate the commands described earlier, in a manner that allows you to do the following:

Retrieve the zone of the target SQL instance before the chaos injection,
Display this information in the logs,
Initiate a SQL failover,
Retrieve the zone of the SQL instance again, and
Display this updated information in the logs.

By following this approach, you can eliminate the need for manual verification through the GCP Console to observe the zone switch before and after the chaos injection. The combined command is:

‌before_zone=$(gcloud sql instances describe "${CLOUD_SQL_INSTANCE_NAME}"
    --project "${CLOUD_SQL_PROJECT}" --format="get(gceZone)") &&
    echo -e "Zone for the primary replica before failover
    ${before_zone}\n" && gcloud sql instances failover
    "${CLOUD_SQL_INSTANCE_NAME}" --project="${CLOUD_SQL_PROJECT}" -q
    && after_zone=$(gcloud sql instances describe
    "${CLOUD_SQL_INSTANCE_NAME}" --project "${CLOUD_SQL_PROJECT}"
    --format="get(gceZone)") && echo -e "\nZone for the primary
    replica after failover ${after_zone}"

To execute this command within chaos injection for real SQL failover scenarios, incorporate it within the ARGS tunable, as:

- name: ARGS
  value: before_zone=$(gcloud sql instances describe "${CLOUD_SQL_INSTANCE_NAME}"
    --project "${CLOUD_SQL_PROJECT}" --format="get(gceZone)") &&
    echo -e "Zone for the primary replica before failover
    ${before_zone}\n" && gcloud sql instances failover
    "${CLOUD_SQL_INSTANCE_NAME}" --project="${CLOUD_SQL_PROJECT}" -q
    && after_zone=$(gcloud sql instances describe
    "${CLOUD_SQL_INSTANCE_NAME}" --project "${CLOUD_SQL_PROJECT}"
    --format="get(gceZone)") && echo -e "\nZone for the primary
    replica after failover ${after_zone}"

Helper Pod Image - LIB_IMAGE

To execute this command successfully, specify an image for the LIB_IMAGE tunable. Considering that harness/chaos-go-runner:main-latest (a generic image utilized by Harness CE) already includes the gcloud binary, you can utilize the same image here.

- name: LIB_IMAGE
  value: docker.io/harness/chaos-go-runner:main-latest

Command to Execute in Helper Pod

The input to the COMMAND tunable depends on the image used in LIB_IMAGE tunable.

For shell script compatible image, the input would be /bin/sh, -c.
For bash script compatible image, the input would be /bin/bash, -c.
For harness/chaos-go-runner:main-latest image which supports bash script, the input would be /bin/bash, -c.

tip

During experiment execution, the helper pod logs help understand the following:

Zone of primary replica before SQL failover
Chaos injection in progress
Zone of primary replica after SQL failover

During experiment execution, the experiment pod logs help understand the following:

Status checks that occur before and after chaos
Chaos injection in progress

Check Status - Resilience Probes

Create resilience probes that help conduct status checks to verify the health of the infrastructure or application and its readiness to endure chaos injection. You can configure these status checks at different stages of the chaos experiment, such as Start of Test (SOT), End of Test (EOT), OnChaos, Continuous, and Edge (Start & End) of Chaos Injection.

tip

For this example, you can create a command probe in the "Edge" mode as it allows you to verify the status of the SQL instance before and after chaos injection, ensuring that the SQL instance remains operational and healthy.

This validation process is executed through the GCloud Console. Solely relying on manual checks via the console may not be the most efficient method.

You can use the gcloud command to fetch the status of the designated SQL instance by using the following command:

gcloud sql instances describe "${CLOUD_SQL_INSTANCE_NAME}" --project "${CLOUD_SQL_PROJECT}" --format="get(state)"

Cross Project IAM Setup - WorkLoad Identity

You can use Workload Identity to understand how to configure a Google Cloud Platform (GCP) Service account so that it can be utilized by litmus-admin for conducting chaos experiments in other GCP Projects while still upholding control over permissions and access to resources.

Suppose the chaos infrastructure is located in Project A, and the target SQL instance is located in Project B. Let us call the service account to employ as SA.

You must establish the service account SA in Project A to facilitate its link to the Kubernetes service account litmus-admin, as specified in the Harness CE documentation and execute the Workload Identity mapping. Once the service account mapping is in place, the next step is to grant relevant permissions to the service account in other projects. You can accomplish this by designating the same service account as a PRINCIPAL in Project B and assigning a role with the necessary permissions as mentioned in the prerequisites.

cross-project setup

Based on these changes, the manifest would look like below:

apiVersion: litmuschaos.io/v1alpha1

kind: ChaosEngine

metadata:
  annotations:
    probeRef: '[{"mode":"Edge","probeID":"cloud-sql-healthcheck"}]'
  creationTimestamp: null
  generateName: byoc-injector-mv0
  labels:
    app.kubernetes.io/component: experiment-job
    app.kubernetes.io/part-of: litmus
    app.kubernetes.io/version: ci
    name: byoc-injector
    workflow_name: cloud-sql-instance-failover
    workflow_run_id: "{{ workflow.uid }}"
  namespace: "{{workflow.parameters.adminModeNamespace}}"
spec:
  appinfo: {}
  chaosServiceAccount: litmus-admin
  components:
    runner:
      nodeSelector:
        iam.gke.io/gke-metadata-server-enabled: "true"
      resources: {}
  engineState: active
  experiments:
    - args:
        - -c
        - ./experiments -name byoc-injector
      command:
        - /bin/bash
      image: docker.io/harness/chaos-go-runner:main-latest
      imagePullPolicy: Always
      name: byoc-injector
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "120"
            - name: LIB_IMAGE
              value: docker.io/harness/chaos-go-runner:main-latest
            - name: COMMAND
              value: /bin/bash, -c
            - name: CLOUD_SQL_INSTANCE_NAME
              value: ""
            - name: CLOUD_SQL_PROJECT
              value: ""
            - name: ARGS
              value: before_zone=$(gcloud sql instances describe "${CLOUD_SQL_INSTANCE_NAME}"
                --project "${CLOUD_SQL_PROJECT}" --format="get(gceZone)") &&
                echo -e "Zone for the primary replica before failover
                ${before_zone}\n" && gcloud sql instances failover
                "${CLOUD_SQL_INSTANCE_NAME}" --project="${CLOUD_SQL_PROJECT}" -q
                && after_zone=$(gcloud sql instances describe
                "${CLOUD_SQL_INSTANCE_NAME}" --project "${CLOUD_SQL_PROJECT}"
                --format="get(gceZone)") && echo -e "\nZone for the primary
                replica after failover ${after_zone}"
          nodeSelector:
            iam.gke.io/gke-metadata-server-enabled: "true"
          resources: {}
          securityContext:
            containerSecurityContext: {}
            podSecurityContext:
              runAsGroup: 0
              runAsUser: 2000
          statusCheckTimeouts: {}
        rank: 0
  jobCleanUpPolicy: delete
  terminationGracePeriodSeconds: 30
status:
  engineStatus: ""
  experiments: null

Overview​

Prerequisites​

Mandatory tunables​

Implementation​

Using GCP Rest APIs​

Using gcloud Binary​

Construct the Command for Cloud SQL Failover​

Helper Pod Image - LIB_IMAGE​

Command to Execute in Helper Pod​

Check Status - Resilience Probes​

Cross Project IAM Setup - WorkLoad Identity​