Skip to content

Application - k10

Intro

Kasten k10 is a proprietary backup solution for k8s from Veeam. Whilst it is paid software, a free license is available for small clusters.

It has benefits over other solutions such as:

  • While it is configured in UI, config is stored in the k8s clusters in CRD's. These can be exported to yaml and stored in your git, allowing for declarative gitops of settings
  • The ability/concept to do a snapshot, and a export (ex. NFS or S3). This can allow for example, a hourly snapshot of a game server and nightly export of a snapshot to NFS
  • Storing a set of backups - v Hourly, x daily, y Weekly, z Yearly.
  • Disaster recovery policy which (by default) runs hourly. This can restore k10 from a blank slate from a external source (NFS/S3), which then allows a full/selecting cluster restore
  • Restore only data pvc's on restore (excellent for us with cluster yaml stored in git)
  • Restore to different namespace (except for cloning a pvc for testing in a test namespace)
  • Comes with its own prometheus and grafana (which can be federated to a main prometheus, as I have done)
  • Clearly shows pvc's or apps that are missed from a backup

k10 Saves

This app has already saved my cluster once (at time of writing!). Don't skimp on backups!

Folder Layout

This deployment has two layers - the root kustomization.yaml installing the helm-release and disaster recovery secret, and the k10-config folder kustomzation.yaml, with a dependency on the base kustomize.

├── helmrelease.yaml                        # (1)
├── k10-config                              # (2)
   ├── blueprints                          # (3)
     ├── hyperion.yaml
     ├── k10-disaster-recovery.yaml
     ├── kustomization.yaml
     ├── postgresql-blueprint.yaml
     └── secret.yaml
   ├── kustomization.yaml                  # (4)
   ├── monitoring                          # (5)
     ├── kustomization.yaml
     ├── prometheus-rule.yaml
     └── service-monitor.yaml
   ├── policies                            # (6)
     ├── daily-backup-policy.yaml
     ├── games-hourly-policy.yaml
     ├── k10-dr-policy.yaml
     └── kustomization.yaml
   └── profiles                            # (7)
       ├── backblaze-b2-secret.sops.yaml
       ├── backblaze-b2.yaml
       ├── home.yaml
       ├── k10-backups-pvc.yaml
       ├── k10-disaster-recovery.yaml
       ├── kustomization.yaml
       ├── media.yaml
       └── synology.yaml
├── kustomization.yaml                      # (8)
└── secret.sops.yaml                        # (9)
  1. k10 helmrelease (installs crd's for the config folder)
  2. Config folder - loads second after k10 base kustomize has loaded k10 and settled (due to the config being CRD's)
  3. Blueprints (allows customization of backup behavior, or pre/post hooks)
  4. k10-config kustomization (calls on the 4 folders in this directory)
  5. Prometheus monitoring (alert for failed backups etc)
  6. Policies (Snapshot & Backup plans)
  7. Profiles (Backup locations)
  8. Base kustomization
  9. SOPS secret for disaster recovery (inc cluster-id key for safekeeping)

Install

I have taken the route of:

  • Installing the helm release
  • Setting up with the UI
  • Using the nifty UI helpers that let you copy a kubectl command to view most everything you can configure
  • Putting these into YAML in my github and letting helm reconcile over the top of it.

Gitops risk

Having flux reconcile the config does mean any changes you make require putting back into your gitops ASAP to avoid flux reverting all your changes. This approach has pros/cons.

This is only a example for explanation purposes

Check my (or others clusters) for up to date examples More helm values info can be found at https://docs.kasten.io/latest/install/advanced.html

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: k10
  namespace: kasten-io
spec:
  releaseName: k10
  interval: 5m
  chart:
    spec:
      chart: k10
      version: 4.5.12
      sourceRef:
        kind: HelmRepository
        name: kasten-charts
        namespace: flux-system
      interval: 5m
  install:
    createNamespace: true
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
  values:
    eula:
      accept: true                                          # (1)
      company: Truxnell                                     # (2)
      email: [email protected]              # (3)

    global:
      persistence:                                          # (4)
        storageClass: ceph-block

    auth:                                                   # (5)
      tokenAuth:
        enabled: true

    clusterName: hegira                                     # (6)

    ingress:
      create: true                                          # (9)
      host: k10.${EXTERNAL_DOMAIN}
      annotations:
        kubernetes.io/ingress.class: traefik
        traefik.ingress.kubernetes.io/router.entrypoints: websecure
        traefik.ingress.kubernetes.io/router.middlewares: network-system-rfc1918-ips@kubernetescrd
        hajimari.io/enable: "true"
        hajimari.io/icon: backup-restore
        hajimari.io/appName: "K10"
        hajimari.io/url: "https://k10.${EXTERNAL_DOMAIN}/k10/"
      urlPath: "k10"                                        # (7)
      hosts:
        - k10.${EXTERNAL_DOMAIN}
      tls:
        enabled: true

    grafana:                                                # (8)
      enabled: false
  1. EULA must be 'manually' accepted
  2. 'Company name' - required
  3. Email for license - github no-reply email is useful here - required
  4. Note we aren't using a existing PVC and letting ceph do as it pleases here - as we have a disaster recovery backup its not important to control this
  5. Tokenauth creates a token to login to k10 - helpful as its a easy login method
  6. Name of cluster
  7. Note this url path - it is a subpath - so my k10 is located at https://k10.hegira.domain.tld/k10
  8. Grafana can be installed, but I want to use my central instance not k10's.
  9. Worth noting this key is required for ingress

Disaster recovery key

k10 Disaster recovery can be enabled in settings -> K10 Disaster Recovery in the UI. When you enabled this, you will be asked for a passphrase, and given a cluster id.

Doing a full DR from a clean install requires the passphrase placed in a secret before install. I chose to put this in a SOPS secret which is loaded with the k10 helmrelease, so it is always on my cluster ready to go. I also put the clusterID in the secret - k10 does not need it, but its there ready to go in a easy to find spot.

You can find this clusterId if you have access to the remote DR backup folder - the folder is named after the clusterID.

apiVersion: v1
kind: Secret
type: Opaque
metadata:
    name: k10-dr-secret
    namespace: kasten-io
stringData:
    key: <super secret password>        # (1)
    clusterId: <cluster id>             # (2)
  1. Password given at setup of DR - it is used as the encryption key for the DR.
  2. ClusterId given in the UI. Looks like 09832d646-209d-438c-95bc-0fgfa5ac6d93

Prometheus federation

k10's prometheus can be federated back to a main prometheus, using a ServiceMonitor. The below is a example configuration that will scape the k10 prometheus. k8s/manifests/kasten-io/k10/k10-config/ Has both the ServiceMonitor and a AlertManager.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: k10
  namespace: kasten-io
spec:
  namespaceSelector:
    matchNames:
      - kasten-io
  selector:
    matchLabels:
      app: prometheus
  endpoints:
    - port: http
      scheme: http
      path: /k10/prometheus/federate
      honorLabels: true
      interval: 15s
      params:
        "match[]":
          - '{__name__=~"jobs.*"}'
          - '{__name__=~"catalog.*"}'

Login

Read the docs

The k10 docs provide what is needed here https://docs.kasten.io/latest/access/authentication.html#token-authentication

I chose to be lazy and just use the token auth that is provided. I believe others use SSO solutions, feel free to choose your approach (and effort)

#Assume K10 is installed in the 'kasten-io' namespace
#Extracting token from SA 'my-kasten-sa'

#get the SA secret
sa_secret=$(kubectl get serviceaccount my-kasten-sa -o jsonpath="{.secrets[0].name}" --namespace kasten-io)

#extract token
kubectl get secret $sa_secret --namespace kasten-io -o jsonpath="{.data.token}{'\n'}" | base64 --decode

The key output to the console can just be pasted in as the password in the UI.

Restoring a pvc

The general workflow for a gitops cluster can be described as:

  1. Scale down relevant deployments to 0 (To release the pvc's)
  2. In k10 UI, select the desired backup in the Applications section
  3. Select 'Data-only restore' (this only restores the pvc data and no configuration)
  4. Check progress of restore in dashboard
  5. Scale up deployment

Disaster recovery

Learn from my mistakes

Im writing this because I force-deleted a kustomization, which led to my entire cluster being wiped. k10 saved my rear end and the DR worked fine. Be careful deleting kustomizations!

Read the docs

The k10 docs provide what is needed here https://docs.kasten.io/latest/operating/dr.html

As long as a DR policy is activated, the passphrase and clusterId has been retained - recovery is easy and safe.

To recover, I bootstrap the cluster as fresh and let everything come up. This brings up everything, as well as a blank k10. This also installs the required secret k10-dr-secret as noted in 'Specifying a DR Passphrase' in the above link.

Then we can run the disaster recovery helm install, checking and replacing in below the namespace, clusterID and profile name (likely k10-disaster-recovery-policy)

Install the helm chart that creates the K10 restore job and wait for completion of the `k10-restore` job
Assumes that K10 is installed in 'kasten-io' namespace.
helm install k10-restore kasten/k10restore --namespace=kasten-io \
    --set sourceClusterID=<source-clusterID> \
    --set profile.name=<location-profile-name>

Ensure k10 is up to date

Ensure k10 is up to date - having a older version and then installing the latest DR helm can lead to funky, yet undesired results.

This will take a few minutes or more and reboot most of k10 at least once. The UI will display a 'recovery in progress' and finally a 'complete' message

Cluster id changes

It appears my clusterID has changed after a successful restore, may be worth looking at after a successful DR. I have had to update my clusterId to reflect the 'new' cluster


Last update: September 22, 2023