Application - k10¶
Intro¶
Kasten k10 is a proprietary backup solution for k8s from Veeam. Whilst it is paid software, a free license is available for small clusters.
It has benefits over other solutions such as:
- While it is configured in UI, config is stored in the k8s clusters in CRD's. These can be exported to yaml and stored in your git, allowing for declarative gitops of settings
- The ability/concept to do a snapshot, and a export (ex. NFS or S3). This can allow for example, a hourly snapshot of a game server and nightly export of a snapshot to NFS
- Storing a set of backups - v Hourly, x daily, y Weekly, z Yearly.
- Disaster recovery policy which (by default) runs hourly. This can restore k10 from a blank slate from a external source (NFS/S3), which then allows a full/selecting cluster restore
- Restore only data pvc's on restore (excellent for us with cluster yaml stored in git)
- Restore to different namespace (except for cloning a pvc for testing in a test namespace)
- Comes with its own prometheus and grafana (which can be federated to a main prometheus, as I have done)
- Clearly shows pvc's or apps that are missed from a backup
k10 Saves
This app has already saved my cluster once (at time of writing!). Don't skimp on backups!
Folder Layout¶
This deployment has two layers - the root kustomization.yaml
installing the helm-release and disaster recovery secret, and the k10-config folder kustomzation.yaml
, with a dependency on the base kustomize.
├── helmrelease.yaml # (1)
├── k10-config # (2)
│ ├── blueprints # (3)
│ │ ├── hyperion.yaml
│ │ ├── k10-disaster-recovery.yaml
│ │ ├── kustomization.yaml
│ │ ├── postgresql-blueprint.yaml
│ │ └── secret.yaml
│ ├── kustomization.yaml # (4)
│ ├── monitoring # (5)
│ │ ├── kustomization.yaml
│ │ ├── prometheus-rule.yaml
│ │ └── service-monitor.yaml
│ ├── policies # (6)
│ │ ├── daily-backup-policy.yaml
│ │ ├── games-hourly-policy.yaml
│ │ ├── k10-dr-policy.yaml
│ │ └── kustomization.yaml
│ └── profiles # (7)
│ ├── backblaze-b2-secret.sops.yaml
│ ├── backblaze-b2.yaml
│ ├── home.yaml
│ ├── k10-backups-pvc.yaml
│ ├── k10-disaster-recovery.yaml
│ ├── kustomization.yaml
│ ├── media.yaml
│ └── synology.yaml
├── kustomization.yaml # (8)
└── secret.sops.yaml # (9)
- k10 helmrelease (installs crd's for the config folder)
- Config folder - loads second after k10 base kustomize has loaded k10 and settled (due to the config being CRD's)
- Blueprints (allows customization of backup behavior, or pre/post hooks)
- k10-config kustomization (calls on the 4 folders in this directory)
- Prometheus monitoring (alert for failed backups etc)
- Policies (Snapshot & Backup plans)
- Profiles (Backup locations)
- Base kustomization
- SOPS secret for disaster recovery (inc cluster-id key for safekeeping)
Install¶
I have taken the route of:
- Installing the helm release
- Setting up with the UI
- Using the nifty UI helpers that let you copy a
kubectl
command to view most everything you can configure - Putting these into YAML in my github and letting helm reconcile over the top of it.
Gitops risk
Having flux reconcile the config does mean any changes you make require putting back into your gitops ASAP to avoid flux reverting all your changes. This approach has pros/cons.
This is only a example for explanation purposes
Check my (or others clusters) for up to date examples More helm values info can be found at https://docs.kasten.io/latest/install/advanced.html
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: k10
namespace: kasten-io
spec:
releaseName: k10
interval: 5m
chart:
spec:
chart: k10
version: 4.5.12
sourceRef:
kind: HelmRepository
name: kasten-charts
namespace: flux-system
interval: 5m
install:
createNamespace: true
createNamespace: true
crds: CreateReplace
remediation:
retries: 3
upgrade:
crds: CreateReplace
remediation:
retries: 3
values:
eula:
accept: true # (1)
company: Truxnell # (2)
email: [email protected] # (3)
global:
persistence: # (4)
storageClass: ceph-block
auth: # (5)
tokenAuth:
enabled: true
clusterName: hegira # (6)
ingress:
create: true # (9)
host: k10.${EXTERNAL_DOMAIN}
annotations:
kubernetes.io/ingress.class: traefik
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.middlewares: network-system-rfc1918-ips@kubernetescrd
hajimari.io/enable: "true"
hajimari.io/icon: backup-restore
hajimari.io/appName: "K10"
hajimari.io/url: "https://k10.${EXTERNAL_DOMAIN}/k10/"
urlPath: "k10" # (7)
hosts:
- k10.${EXTERNAL_DOMAIN}
tls:
enabled: true
grafana: # (8)
enabled: false
- EULA must be 'manually' accepted
- 'Company name' - required
- Email for license - github no-reply email is useful here - required
- Note we aren't using a existing PVC and letting ceph do as it pleases here - as we have a disaster recovery backup its not important to control this
- Tokenauth creates a token to login to k10 - helpful as its a easy login method
- Name of cluster
- Note this url path - it is a subpath - so my k10 is located at
https://k10.hegira.domain.tld/k10
- Grafana can be installed, but I want to use my central instance not k10's.
- Worth noting this key is required for ingress
Disaster recovery key¶
k10 Disaster recovery can be enabled in settings -> K10 Disaster Recovery in the UI. When you enabled this, you will be asked for a passphrase, and given a cluster id.
Doing a full DR from a clean install requires the passphrase placed in a secret before install. I chose to put this in a SOPS secret which is loaded with the k10 helmrelease, so it is always on my cluster ready to go. I also put the clusterID in the secret - k10 does not need it, but its there ready to go in a easy to find spot.
You can find this clusterId if you have access to the remote DR backup folder - the folder is named after the clusterID.
apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: k10-dr-secret
namespace: kasten-io
stringData:
key: <super secret password> # (1)
clusterId: <cluster id> # (2)
- Password given at setup of DR - it is used as the encryption key for the DR.
- ClusterId given in the UI. Looks like 09832d646-209d-438c-95bc-0fgfa5ac6d93
Prometheus federation¶
k10's prometheus can be federated back to a main prometheus, using a ServiceMonitor. The below is a example configuration that will scape the k10 prometheus. k8s/manifests/kasten-io/k10/k10-config/ Has both the ServiceMonitor and a AlertManager.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: k10
namespace: kasten-io
spec:
namespaceSelector:
matchNames:
- kasten-io
selector:
matchLabels:
app: prometheus
endpoints:
- port: http
scheme: http
path: /k10/prometheus/federate
honorLabels: true
interval: 15s
params:
"match[]":
- '{__name__=~"jobs.*"}'
- '{__name__=~"catalog.*"}'
Login¶
Read the docs
The k10 docs provide what is needed here https://docs.kasten.io/latest/access/authentication.html#token-authentication
I chose to be lazy and just use the token auth that is provided. I believe others use SSO solutions, feel free to choose your approach (and effort)
#Assume K10 is installed in the 'kasten-io' namespace
#Extracting token from SA 'my-kasten-sa'
#get the SA secret
sa_secret=$(kubectl get serviceaccount my-kasten-sa -o jsonpath="{.secrets[0].name}" --namespace kasten-io)
#extract token
kubectl get secret $sa_secret --namespace kasten-io -o jsonpath="{.data.token}{'\n'}" | base64 --decode
The key output to the console can just be pasted in as the password in the UI.
Restoring a pvc¶
The general workflow for a gitops cluster can be described as:
- Scale down relevant deployments to 0 (To release the pvc's)
- In k10 UI, select the desired backup in the Applications section
- Select 'Data-only restore' (this only restores the pvc data and no configuration)
- Check progress of restore in dashboard
- Scale up deployment
Disaster recovery¶
Learn from my mistakes
Im writing this because I force-deleted a kustomization, which led to my entire cluster being wiped. k10 saved my rear end and the DR worked fine. Be careful deleting kustomizations!
Read the docs
The k10 docs provide what is needed here https://docs.kasten.io/latest/operating/dr.html
As long as a DR policy is activated, the passphrase and clusterId has been retained - recovery is easy and safe.
To recover, I bootstrap the cluster as fresh and let everything come up. This brings up everything, as well as a blank k10. This also installs the required secret k10-dr-secret
as noted in 'Specifying a DR Passphrase' in the above link.
Then we can run the disaster recovery helm install, checking and replacing in below the namespace, clusterID and profile name (likely k10-disaster-recovery-policy
)
Install the helm chart that creates the K10 restore job and wait for completion of the `k10-restore` job
Assumes that K10 is installed in 'kasten-io' namespace.
helm install k10-restore kasten/k10restore --namespace=kasten-io \
--set sourceClusterID=<source-clusterID> \
--set profile.name=<location-profile-name>
Ensure k10 is up to date
Ensure k10 is up to date - having a older version and then installing the latest DR helm can lead to funky, yet undesired results.
This will take a few minutes or more and reboot most of k10 at least once. The UI will display a 'recovery in progress' and finally a 'complete' message
Cluster id changes
It appears my clusterID has changed after a successful restore, may be worth looking at after a successful DR. I have had to update my clusterId to reflect the 'new' cluster