Skip to content

Kubernetes overview architecture

Created: 2020-10-01 16:22:56 -0700 Modified: 2022-01-30 10:23:46 -0800

Kubernetes is orchestration software for containers. It can handle scaling up or down many containers for an infrastructure, handle communication between them, storage, services, etc. Formally, the features of Kubernetes are (reference):

  • Service discovery and load balancing
  • Self-healing
  • Secret and configuration management
  • Storage orchestration: you can use local or cloud-provided storage
  • Automated rollouts and rollbacks: this is done by describing the desired state
  • Automatic bin packing: Kubernetes will figure out how best to pack your containers based on resource needs

As for usage, the biggest learning I had while streaming this was that many people tried to adopt Kubernetes before they really needed it. Kubernetes can handle a lot of complexity, but that doesn’t mean that a small-scale application or service should necessarily use it.

Pronunciation note: no one says it the “official” way.

This has some overlap with containerization in general:

  • Physical machines: when people were running on physical machines, applications could not easily be isolated, and it was harder to effectively fully utilize the resources of the machine.
  • Virtual machines: this allowed for isolation and better utilization, but the operating system needed to be “duplicated” in each VM
  • Containerization: the container runtime helped each container share the OS in a lightweight way. Containers are essentially hardware- and OS-agnostic, so they can be moved between nodes easily.
    • The formal phrasing of this is that containers are decoupled from the underlying infrastructure.
  • Node: worker machines that run containerized applications
  • Cluster: a group of nodes (at least one node per cluster is required)
  • Pod: a set of running containers on a node modeling an application-specific “logical host”. A pod cannot span nodes. A pod has a definition for how to run its containers.
  • Service: an abstract way to expose a running application on a set of pods (which all provide the same functionality (reference)) as a network service (i.e. the operative word is “expose” since you’re making this available on the network (even if it’s just internal to your cluster (reference))). There’s an easy-to-understand explanation of these concepts here. In short, end users connect to this service abstraction, not pods.
    • [10:15] atomicnibble: service is done via dns so when something wants to talk to a service it query the dns. this is used so traffic can be routed to nodes that are running the service / add new pods without reconfig other applications.
    • Endpoints: these track the IP addresses of pods (and ports) with matching selectors. They’re typically managed by Services.
  • Namespace: a way of supporting multiple virtual clusters within your physical cluster.
  • Ephemeral container: a short-lived container initiated by the user in a pod, e.g. to troubleshoot something too difficult to see externally.
  • KEP: Kubernetes Enhancement Proposal
  • IPVS: short for “IP Virtual Server”, this is part of the Linux kernel responsible for doing transport-layer load balancing.
  • Custom Resource Definition (“CRD”): a way of extending the Kubernetes API Server if it doesn’t meet your needs, e.g. make your own type of object like “SpacePod” and manage it just like how regular Pods are managed.
  • DaemonSet: this ensures that a copy of a pod is running on a set of nodes as opposed to just on one node. This does not mean that the pod itself spans those nodes; it’s a copy of the pod (i.e. the same pod definition is used for each node). This is typically used for high availability (or for critical node services like kube-proxy).
  • Control Plane (reference, picture reference): the orchestration layer that defines APIs and interfaces to define, deploy, and manage the lifecycle of containers. This manages the worker nodes and pods in a cluster. In production, this usually runs across multiple computers.
    • [09:48] atomicnibble: @Adam13531 The control plane basically handles applying desired state, if the plane is down the cluster can still accept traffic but state changes go unnoticed. So it won’t notice if a node is offline and will still send traffic to it etc.
    • [09:55] atomicnibble: If you get a managed k8s service everything in the control plane is hidden / done for you.
    • The control plane is essentially everything that’s not your application. While it could technically run on the same machine with your application’s pods, that’s not typical.
      • If you only had three nodes in a system, which is Kubernetes’ suggestion as the minimum for a production system, then the control plane would run on all three, that way a consensus can be reached.
    • Control-plane components
      • API Server: this is the front-end for the control plane that exposes the Kubernetes API. The main implementation is kube-apiserver.
        • The API itself is a REST API (i.e. it’s over HTTP). There are different popular interfaces like kubectl and kubeadm.
      • etcd: key-value store consistent across the cluster (i.e. a distributed database). It is backed by Raft (reference), a consensus algorithm. Updating this does not require rebuilding images or redeploying; containers will simply see the updated values. etcd stores pretty much everything you see when you do “kubectl get xyz” (reference). Nodes ask etcd for values when they’re needed (as opposed to etcd broadcasting changes to all nodes).
      • Scheduler (e.g. kube-scheduler): control-plane component which determines which node to run new pods on
      • Controller: control loops that try to shift the current state toward the desired state. When this state has to do with your cluster, it’s typically done via the API Server (although not always). For more information about specific controller processes, there’s a small blurb here.
        • Object: the desired state of your application is represented via these objects. They are persistent entities in the Kubernetes system.
      • Controller Manager (e.g. kube-controller-manager): control-plane component which runs controller processes (e.g. the node controller or the replication controller). Logically, these are all separate processes, but to reduce complexity, they’re all run in a single process.
        • Also note that the nodes themselves only talk to the API Server, so if the Node Controller needs to ask for the health status of a node, it will go through the API Server.
      • Cloud Controller Manager (optional): the main point of contact for the cloud platform that you’re running on (e.g. to provision new nodes or load balancers). Just like with the “regular” controller manager, this runs several controllers in a single process.
  • Node components (these run on every node)
    • kubelet: an agent that ensures that containers are running in a pod and that they’re healthy. It’s essentially just the point of contact for the control plane.
    • kube-proxy: a network proxy that implements part of the Kubernetes Service concept. It maintains network rules for traffic both inside and outside of the cluster either via the operating system’s “network primitives” itself or its own forwarding rules.
    • Container runtime: this is what’s needed in order to run the containers themselves, e.g. Docker
  • Strive to be Kubernetes-agnostic: your application should not know or care that it’s running inside of Kubernetes specifically. As such, your application shouldn’t call into the API Server (reference).
  • Not all objects have to have a namespace (reference)
  • Requests themselves can be run in specific namespaces via kubectl (reference)
  • Names of objects only have to be unique per object type within a namespace
  • Namespaces can also be used to divvy up resources from a technical perspective, e.g. that way Namespace A can only use 30% of the resources
  • Prefer the use of labels over namespaces when just differentiating versions of the same software
  • Labels are just key/value pairs that can be anything you want. They’re used for anything from versions to departments to release tracks (reference).
  • They’re not unique.
  • Label selectors are used to identify objects and are a core grouping primitive of Kubernetes. For example, you may say “give me all of the ‘canary’-tagged objects”.
    • AND vs. OR: selectors can have multiple conditional clauses that are AND’d together. There is no OR operator (reference).
    • Definition: you define selectors separately from labels in the YAML/JSON, and the selector doesn’t have to be based on an explicitly defined label.
    • Equality vs. sets: Selectors are either based on equality (equals or not equals) or sets (in, not in, or exists).
    • Field selectors: if you want to select based on other fields in an object (i.e. not labels), you can use field selectors.
    • Don’t update selectors (reference): although you could if you needed to
    • Blank/empty selectors (reference): the semantics of this differ based on the context, so it doesn’t always mean “select every pod”.
  • If you want to attach metadata to an object without ever being able to select the object based on it, then you should use an annotation.

There are three ways to manage objects:

  • Purely imperative (reference) - you use a series of commands to alter the state of live objects (this is basically just using kubectl for everything)
  • Imperative + declarative (reference) - you define configuration files per object and run a command to set that as the desired state (this is like using kubectl with the “-f” option to specify a file)
  • Purely declarative (reference) - you define folders of YAML/JSON files that specify the entire desired state (i.e. all objects) and kubectl detects how to reconcile that configuration

The benefits of going with declarative management are described here. A quick summary in case that link breaks:

  • Clearly represent intended state: no need for code, HTTP calls, etc.
  • No need for a domain-specific language (DSL): use JavaScript, Go, Fortran, or whatever you want with YAML/JSON
  • Can be statically analyzed/validated
  • Easy to develop CLIs/UIs for a configuration format

My assumption about why tools like kpt and kustomize (and maybe Helm) exist is mostly for simplicity’s sake; the purely declarative style of managing objects was probably not easy enough for people.

This section is about practical usage of Kubernetes.

  • If you’re using the Distroless image (because it’s small and only has exactly what you need), then you won’t have a shell, so debugging can be tough. In early 2022, ephemeral containers were introduced in beta, so “kubectl debug” can let you gain access to the container.
  • By default, nodes auto-register themselves with the API Server via the kubelet. However, you can also manually register nodes, probably just for test cases (or potentially some uncommon edge case).
  • Regardless of how a node is registered, if it ever becomes unhealthy, then the control plane will continually check for it to become healthy.
  • A node’s status contains four major pieces of information:
    • Addresses, e.g. external/internal IP addresses
    • Conditions, e.g. whether the node is ready for pods to be scheduled, whether the disk or memory is running out
    • Capacity and allocatable, e.g. total resources (“4 GB RAM”) and how much is left to be allocated (“2 GB remaining”)
    • Info, e.g. the OS version, Kubernetes version, etc.
  • Nodes themselves send heartbeats in two forms: via updating NodeStatus or via the Lease object (which is a way to improve performance as the cluster scales). The status updates happen when there’s a change in status or every 5 minutes (by default).
    • For more information about how the Lease object works, check out this enhancement on GitHub. It used to be that nodes would report NodeStatus regardless of whether there were meaningful changes. Now, they’re only sent when there’s a change, and the Lease object is more of a ping just for healthiness than to include much about the status itself.
  • Nodes in a cluster only ever communicate to the control plane via the API Server. The intent is for your cluster to be able to be run on untrusted, public machines. As such, authentication and authorization is done to the API Server over HTTPS.
  • Spec is desired state: the “spec” field for objects indicates the desired state of the object (reference).
    • The desired state may never be reached (or it could be a moving target). That’s okay as long as controllers are constantly working toward it (reference).
  • Communication with or without API Server: I believe that all built-in controllers only interact with the API Server to converge on the desired state, but controllers in general could make changes by themselves. Regardless, when changing the state of the nodes themselves, the only communication is through the API Server.
  • Small and specific: in general, controllers are intended to be small and specific. A resource created by a controller should only ever be deleted by that same controller.
    • Resources potentially owned by different controllers (reference): when multiple potential controllers could manage a resource, the official manager is typically set via labels, e.g. EndpointSlices get a “endpointslice.kubernetes.io/managed-by” label.
  • Custom controllers can be run outside of the control plane (reference): just like a containerized application, you can run a controller on your nodes, e.g. in a Deployment.
  • Updating images (reference)
    • The imagePullPolicy specifies when to update container images.
      • tomcantcode: this happens a lot for me at work re: ImagePullPolicy 1) pod is running “myimage:dev ” 2) you build and push a new “myimage:dev ” tag 3) imagepullpolicy is never -> when you delete+restart the pod, you won’t see your new container you just built 4) imagepullpolicy -> always, it will pull the new “myimage:dev ”
  • Images’ architecture (e.g. ARM vs. AMD64) needs to match the container runtime’s architecture since the CPU isn’t virtualized (reference).
  • Container runtime (reference) (AKA “the Kubernetes Container environment”)
    • The runtime provides cluster information to each container in the form of environment variables representing the host and port of each service running (reference).
  • RuntimeClass (reference)
    • This is a mechanism for selecting a configuration for a particular container, e.g. for selecting or modifying a particular container runtime (i.e. “tweaking Docker”).
      • This has Scheduler support (reference), i.e. you can make sure that the scheduler will take RuntimeClass into account.
  • Container lifecycle hooks (reference)
    • This is a way to run code when a particular thing happens to a container, e.g. it just started (PostStart) or it’s about to stop (PreStop). This code can be run either via an executable (or script) or by calling into an HTTP API hosted by the container.
    • The hooks are required to complete before transitioning into a new state. For example, without the PostStart hook finishing, the container cannot enter the “running” state.
  • Pods instead of containers as a primitive: Kubernetes manages pods, not containers, even though having one container per pod is by far the most common scenario (reference). The other scenario, running multiple containers in a pod, is typically when the containers themselves are tightly coupled. This is why containers within a pod are co-located.

    • Sharing resources: containers within a pod can share resources (e.g. storage, networking, or even semaphores and shared memory space) and dependencies and communicate with one another. This is similar to how docker-compose works (docker-swarm is similar to docker-compose except that it’s across multiple machines).
      • Within a pod, containers can connect to one another using localhost.
  • Horizontal scaling == replication: scaling pods horizontally involves having multiple copies of the same pod running. This is typically referred to as replication.

    • Scaling with no downtime: something that I was curious about before learning Kubernetes: “for my game, if I have a single server container running and traffic spikes, how would Kubernetes scale up with no downtime?” The answer is essentially “replicas” in that you would have multiple pods ready at any given time to handle traffic spikes. This is as opposed to some kind of secret sauce that could potentially let containers start almost instantaneously, although that kind of reduction in start-up time would certainly be beneficial!
  • Hostnames: the hostname of a container in a pod is set by Kubernetes to be the name of the pod (reference). It’s available via the “hostname” command (it lives in /etc/hostname) or via the environment variable “HOSTNAME”. An example hostname is “kubernetes-bootcamp-765bf4c7b4-7rth6”, where “kubernetes-bootcamp” is the name of the deployment, 765bf4c7b4 is the version of the deployment, and 7rth6 is a random set of characters generated based on pod-template-hash (reference (see step 1)).

  • Many pods on one node: multiple pods can run on a single node. This is where the automatic bin packing comes in (reference).

  • Init vs. app: “init containers” (reference) are containers that run to completion before the pod’s regular “app containers” start up.

    • Rationale for wanting init containers: reference
    • Middleware-esque: each init container runs sequentially and must complete before the next step of the process can run (either the next init container or the app containers).
    • Restarts/failures: init containers are basically just like regular containers in this regard; they’ll follow the restartPolicy (reference). As a result, init containers should be idempotent.
      • restartPolicy affects all containers in a pod.
      • The three restart policies are Always, OnFailure, and Never (reference).
      • When the pod is part of a resource like a Deployment, the restartPolicy can only ever be Always (reference). This should make sense; if the restartPolicy were to be OnFailure or Never, then at start-up time, it’s possible that the Deployment could be held up until it times out. Afterward, the Deployment Controller would have one means of recourse, which would be to restart the whole Pod. That’s why the restartPolicy has to be Always in the first place.
      • If the whole pod restarts (for these reasons), all init containers are run again (this should be obvious since init containers should be required for the app containers to run).
  • PodTemplates: PodTemplates are specs for creating pods. They are part of workload resources like Deployments, Jobs, and DaemonSets, which are managed by specific controllers.

  • Don’t use naked pods (reference): it’s much more common to use a Deployment or a Job.

  • Pod lifetime: a pod is only scheduled once in its lifetime (reference) and remains on a node until one of the following happens (reference):

    • The pod finishes execution

    • The pod object is deleted

    • The pod is evicted for lack of resources

    • The node fails

    • As such, pods are never “relocated” after being scheduled.

    • Despite this, pods are considered to be more ephemeral than durable (reference).

    • Container restarts: containers can be restarted within a pod by the kubelet. This ownership of responsibilities is sort of used to represent that pods can’t self-heal (reference).

  • Static pods: static pods are pods that are specific to a particular node and are not controlled by the control plane. Instead, the kubelet on the node will supervise the static pod, and a mirror copy will exist in the control plane just for visibility.

  • Status reports: just like with Nodes, Pods report their status as a set of conditions.

  • Pods live on one node: pods are never moved to other nodes (reference)

  • Topology spread constraints (reference): these are used to spread your pods out across failure domains (i.e. regions, zones, and nodes).

    • This feature is for high availability and for effectively utilizing resources.
    • These constraints rely on your zone being tagged with a name, a zone, and a region.
  • Pod presets (reference): these presets provide information to pods at creation time, e.g. environment variables, volumes, volume mounts, and secrets. These are applied via label selectors (reference).

  • Ephemeral containers (reference): these are user-injected containers, e.g. for troubleshooting something too hard to see externally. They are not guaranteed resources when it comes to scheduling, so it’s possible that they can’t even start.

  • Pod priority and preemption/eviction (reference): Pods can be given a priority. If the scheduler can’t find a suitable node for a pod to be run on, it will try evicting lower-priority Pods to make room.

    • PriorityClass (reference): these are essentially just names for priority values, e.g. “high-priority” can refer to the value 1000000. There are two built-ins: system-cluster-critical (2000000000) and system-node-critical (2000001000).
      • Non-preempting (reference): a PriorityClass can also specify whether it can preempt other pods. A non-preempting pod means it will never cause an eviction when it fails to be scheduled.
    • Preemption rules (reference): there are a lot of criteria to pod preemption and many caveats that arise from the concept as a whole. The reference page has much more information, but quick examples are: you can’t exceed your PodDisruptionBudget, pods still get their graceful termination period, higher priority pods may preempt a pod that is already trying to preempt other, even-lower-priority pods.
  • These are diagnostic actions performed by the kubelet on a container.
  • Types of probes
    • There are three types of probes that can be run a container regardless of whether the container’s status is officially “Running”:
      • ExecAction: executes a command. An exit code of 0 indicates success.
      • TCPSocketAction: tries to open a TCP connection at the specified port. Opening the port indicates success.
      • HTTPGetAction: performs a GET request on the specified address and port. HTTP status code of 200-400 indicates success.
    • There are three more kinds of probes that can only be containers with a “Running” status (reference). When any of these probes fail, either the kubelet or the control plane will take some action either to remedy the situation or “restart”.
  • Probes can result in Success, Failure, or Unknown
  • Wrap ReplicaSets: in the real world, apparently you usually manage a ReplicaSet indirectly through a Deployment (reference, reference2). Each time a new Deployment is observed by the Deployment Controller, a new ReplicaSet is created to bring up the desired pods, and all old ReplicaSets in the Deployment are scaled down to 0 pods (reference).
    • Multiple updates at once (reference): if you make a change while one is already in progress, the in-progress change does not have to run to completion for the new change to start taking effect. The reference link has an example.
  • Pod template (reference): it’s essentially the exact same a Pod’s “spec” field (barring tiny changes) when used in a workload resource’s spec, where a “workload” is a Deployment, Job, or DaemonSet (reference).
  • Rollout triggering: a rollout is only triggered if parts of the pod’s template are changed, not if the number of replicas is changed (reference).
    • Rollbacks (reference): rollouts can always be reverted via “kubectl rollout undo DEPLOYMENT_NAME” (reference (step 3)).
  • Ability to pause (reference): you can pause deployments so that you don’t trigger unnecessary rollouts.
  • Failures (reference): deployments can fail for any number of reasons. This is why progressDeadlineSeconds can be helpful for a deployment’s spec.
    • Resetting the deadline: I believe, based on this code, that any progress will “reset” that deadline.
    • Resolving failures: the documentation isn’t super clear on this, but I believe that if you resolve a failure yourself (e.g. you didn’t have enough quota, so you provision more nodes), then the Deployment Controller will pick up on this and automatically resume the rollout.
    • ”Failures” after startup: assuming that the Deployment has successfully started, then you may run into a situation where a pod repeatedly fails, and since pods are required to have a restartPolicy of Always when they’re in a Deployment, you may hit CrashLoopBackOff (reference). This is easy to detect with “kubectl get pods”, but there’s no way for Kubernetes to automatically fix this since it already tried restarting the pods some number of times (based on your configuration).
  • Strategy (for replacing old pods with new ones) (reference):
    • Recreate (reference): kill all old pods before new ones can be created.
    • RollingUpdate (reference): create new pods before killing off the old pods.
      • maxUnavailable (reference): the number of pods below the desired number of pods that are allowed to unavailable. It’s better understood by thinking of this as “100% - maxAvailable” (maxAvailable not being a real Kubernetes concept though). For example, if maxUnavailable is 30% and you have a desired number of 10 pods, then you will always have 7 pods available between the old and new ReplicaSets.
      • maxSurge (reference): the number of pods beyond the desired number of pods that are allowed. E.g. a desired number of 10 and maxSurge of 30% would allow for 3 more pods.
  • Pod scaling
    • Scaling vs. auto-scaling: scaling is done via ReplicaSets and/or auto-scaling policies.
      • Scaling (← just “regular” scaling for now, not auto-scaling): a Deployment wraps a ReplicaSet and has a “replicas” field that lets you specify how many instances of a particular pod you want (reference).
      • Auto-scaling: auto-scaling will let you scale based on criteria (reference), e.g. “try to maintain 80% CPU utilization across 10 to 15 nodes”.
    • Scaling to 0: scaling to 0 is possible and will terminate all pods in the specified Deployment (reference)
    • Scaling via kubectl (reference): kubectl scale DEPLOYMENT_NAME —replicas=10
    • Proportional scaling (reference): this is specifically for scaling a RollingUpdate Deployment that’s in the middle of a rollout. When new replicas are available, they’re added in such a way that they favor ReplicaSets with more replicas in them.
  • Cluster scaling (reference): the cluster can be automatically resized based on your needs. For example, if there are pods that can’t be scheduled and adding a new node would help, then the cloud platform will provision another node. There are configuration options for controlling the minimum/maximum and whether cluster autoscaling is even enabled.
    • Zone-based provisioning: I didn’t find any documentation specific to how exactly these new nodes are spun up (e.g. what if you eventually need 9 new nodes, 3 in each availability zone that you run in?).
  • Use case (reference): you don’t tend to interact with ReplicaSets directly, instead favoring Deployments. Also, ReplicaSets are advised over just bare pods so that the pods’ lifecycle can be managed by Kubernetes.
    • Jobs vs. ReplicaSet (reference): use a Job instead of a ReplicaSet when your pods should terminate when they’re done with their work, e.g. a batch job (e.g. performing a back-up).
    • DaemonSet vs. ReplicaSet (reference): when your pods should be tied to the lifetime of the machine.
  • Acquisition and ownership of pods (reference): ReplicaSets have rules for how to acquire pods, then Kubernetes will automatically set the metadata.ownerReferences field so that it’s clear who owns what.
    • Be careful about which pods you select (reference): ReplicaSets can acquire pods that you manually created if they match the label selector of the ReplicaSet.
    • Changing labels to remove pods (reference): if you change the label of a pod, you can extract the pod from the ReplicaSet, e.g. for debugging or data recovery.
  • Use case (reference): StatefulSets are needed when you require persistence or ordering (where “ordering” means something like service dependencies; you have Service X that relies on Service Y, so you have to shut X down before Y—pods are created and deleted sequentially (reference)). Based on what people in chat said, it sounds like these aren’t often used, since state and ordering are generally good things to avoid when making microservices. Apparently they’re mostly used for databases.
    • Persistence (reference): in order to support persistence, storage is not deleted when pods are scaled down. Also, provisioning that storage is either done by an admin or via PersistentVolume Provisioner.
  • Use case (reference): when you want to ensure that some or all nodes run a particular pod, you’d use a DaemonSet. These pods are typically tied to the lifetime of their underlying machines (since they’re typically for machine-related tasks). Sample use cases include cluster storage, log collection, or node monitoring, all of which would be done on every node.
    • Init scripts as an alternative (reference): DaemonSets can sort of be replaced by init scripts like “init” or “systemd”, but then you’re managing your DaemonSets completely separately from the rest of your application, so you don’t get a declarative style of writing them and likely have to bake them into the machine image that you’re using for your nodes.
  • Use case (reference): use a job when you want to ensure that a work item runs to completion. The success of the underlying pod (or multiple pods) indicates the success of the job.
    • Sample scenarios and how to make jobs for them: see “Job Patterns
    • Single pod (reference): a common use case is to just have a single pod within the job. The job is complete when the pod terminates successfully.
    • Fixed completion count (reference) (AKA .spec.completions): X pods need to complete successfully for the job itself to complete. I’m not totally sure why you’d ever want this. Maybe it’s if you had X separate sub-tasks that all roll up to the same overall job, e.g.:
      • cube8021: Fixed completion can be used for Smoke testing like create 100 orders and make sure all 100 order complete successfully
    • Parallel jobs with a work queue (reference) (AKA .spec.parallelism): batch processing is an example of this. For example, suppose you have 1 million units of work, so you launch a job of consumer pods to handle that work. The pods would need a way of determining when the overall job is done, which in this case might be that the remaining number of work items is equal to zero. When any pod completes successfully, the job is marked as succesful.
  • backoffLimit (reference): this specifies how many times pods can fail within the job before the job is considered a failure.
  • Completion (reference): when a job completes, it doesn’t delete the underlying pods, and the job object will stay around until “kubectl delete” is run on it. Deleting a job in that way will delete the underlying pods. Jobs can be cleaned up via a TTL mechanism (reference), and CronJobs can be cleaned up automatically after finishing (reference).
  • Be careful about pod selectors (reference): by default, Kubernetes will select pods that aren’t used by anything else. However, there’s a mechanism for selecting pods yourself. If you do that, be careful about how you select those pods so that your job doesn’t delete the wrong resources.
  • Basics
    • Each object can have any number of owners (including 0).
    • Some owners are set automatically (e.g. a ReplicaSet sets the pods to be owned by the ReplicaSet)
    • Owners are defined in metadata.ownerReferences.
    • Owned objects are called dependents.
    • Orphans are objects with owners defined but that no longer exist. I think an object is an orphan if any owners no longer exist.
    • You can force orphans by specifying “—cascade=false” when deleting an owner.
  • Foreground cascading deletion (reference): the owner can’t be deleted until all dependents with ownerReference.blockOwnerDeletion=true are deleted. If there are no blocking dependents, then everything is probably deleted all at once.
  • Background cascading deletion (reference): the GC deletes the owner immediately and then deletes dependents.
  • Kubernetes networking: Kubernetes networking addresses four concerns
    • Containers within a pod use networking to communicate via loopback
    • Different pods communicate via cluster networking
    • Services let you expose pods outside of the cluster
    • Services also let you expose pods to one another within the cluster
  • Kubernetes networking vs. Docker networking (reference): by default, Docker uses host-private networking, meaning that containers can only talk to other containers that are on the same machine without allocating ports and forwarding traffic. In contrast, Kubernetes allows pods to communicate with one another regardless of which host they land on.
  • Service discovery: this is built-in to Kubernetes, meaning you don’t need to use an external service.
  • Proxying (reference): there are many kinds of proxies in Kubernetes, e.g. kubectl itself is a proxy from the administrator’s computer to the API Server, and the API-Server proxy is to connect external users via cluster IP addresses to individual services.
    • kube-proxy proxy modes (reference): these are linked to from the last page
      • User-space proxy: this uses iptables internally just to establish redirects from a virtual IP address to a port. However, the traffic itself is going through kube-proxy rather than just pointing one address at another like the “iptables” option below would do.
        • This is sort of like a reverse proxy in that a different back-end can be tried if the first one fails or refuses the connection for whatever reason.
        • User-space proxies can have logging performed.
      • iptables: this is sort of like a glorified hashmap lookup—given an address and some very basic rules, you’ll end up with the address of the destination node. This is generally very fast, but the packets don’t ever enter user-space.
      • IPVS (reference): used for transport-layer load balancing. This is beneficial over iptables since iptables starts to crumble with a large number of records.
  • PROXY protocol (reference): while I was streaming, I asked “what is the PROXY protocol and why is it good for Kubernetes?” and got this response (only edited for English):

[09:49] atomicnibble: Oh the PROXY protocol is great for k8s.

[09:50] atomicnibble: Main benfit is you can add info to an incoming connection without having to terminate the SSL. So for example you can add external IP on at the LB even though some service inside your cluster will terminate the SSL so the LB can’t inject HTTP headers etc.

[09:51] atomicnibble: Yes I use PROXY protocol [the X-Forwarded-For header]

[09:52] atomicnibble: But if you terminate all SSL at LB you have no need for PROXY protocol.

  • Who is doing the load-balancing? (reference) - the load balancing is most typically done by your cloud provider (e.g. ELB on AWS). However, Ingress can provide load balancing (reference), so with a custom Ingress Controller, you could use something like nginx (reference, reference2, reference3). External traffic can then be routed through the load balancer.
    • Performance implication: suppose you get 100k requests per second to your cluster. I’m pretty sure that employing load balancing via an Ingress Controller means that those requests will be funneled through the control plane before they get to your load balancer. I couldn’t find resources about any performance implications here or whether using a cloud provider’s load balancer would be any better (since it presumably wouldn’t have to go through Kubernetes first?). However, since ingress rules define criteria for incoming traffic before routing to the back-end service, it means that there must still be some minimal processing done by the control plane before traffic reaches your load balancer.
  • Rationale (reference): pods themselves are temporary and may spin up or down frequently (bonus word for this: “fungible”). Services exist as an abstraction so that you never need to know exactly which pods are handling something. Instead, you simply refer to the service.
    • Non-pod back-ends (reference): services can abstract resources other than just pods, in which case you wouldn’t use a pod selector to define the service.
      • It sounds like this is mostly for edge cases, in which case you should look at the reference link.
  • Choosing pods for a service: this can be done via a label selector, but you can omit the selector in some specific cases (reference).
  • IP assignment (reference): Kubernetes assigns an IP (AKA the “cluster IP”) to the service itself, which is used by the service proxies. I believe this is specifically done via an Endpoint object.
    • Virtual IP: there is no physical network interface involved in creating a cluster IP, so that makes the cluster IP a virtual IP address.
    • No pod selector → no automatic endpoint: when you don’t provide a pod selector, an Endpoint object is not created automatically (reference). Thus, you must make an Endpoint object with an IP address, then your service will be accessible through the IP that you defined.
    • ExternalName + DNS (reference, reference2): you can map a service to an external DNS name, e.g. you could point to something like example.com which your system doesn’t even have to manage. In this case, the cluster DNS Service would return a CNAME record and redirection would happen at the [non-Kubernetes] DNS level.
      • At this point, since the service is referred to by a CNAME, it doesn’t even need to be managed by Kubernetes itself (e.g. a service without a pod selector (reference)).
  • Named ports (reference): ports can have names (although they don’t have to). Named ports are helpful as another abstraction layer; if you want to change a port number, you can change it in the service’s spec without having to update clients to use a different name. Named ports look like this (reference):
ports:
- name: http
protocol: TCP
port: 80
targetPort: 9376
- name: https
protocol: TCP
port: 443
targetPort: 9377
  • ↑ As hinted at in the example, services can expose multiple ports (reference).

  • DNS

    • What is “regular” DNS? In any DNS, you provide a domain name (“example.com”) and get back an IP address (“1.2.3.4”). When I say “regular” DNS, I’m referring to what your average computer does to resolve something like google.com: it asks its configured DNS server (typically your ISP), and if that server doesn’t know the IP, it has a mechanism to go all the way to an authoritative root server to find out. Then, the various layers can cache the DNS record and [hopefully] abide by the record’s TTL to know when to refresh the record it has of the IP.
    • Kubernetes DNS (reference): Kubernetes has a one-way link to “regular” DNS in that you can still resolve “google.com” from inside a container, but a random person outside of your cluster won’t be able to resolve “my-service” to get to your cluster’s IP address. This is because kubelets tell individual containers to use the DNS Service as the authority, so all requests have a chance to be answered with a cluster-specific IP address.
    • Why not use regular DNS with round robin? (reference) Kubernetes’ form of DNS is done via proxying, and it’s because using regular DNS could cause problems, e.g.:
      • Undue load on DNS when the TTL of records is low
      • Caching results for too long (or forever), thus returning stale IP addresses
    • kube-proxy flow (reference): kube-proxy listens for changes to Services and Endpoints via the API Server, then any incoming connections are proxied to the appropriate back-end pods (the mechanism by which this is accomplished is at the OS level (e.g. iptables)). I.e. the node itself needs to configure where each individual container’s IP should go so that communication runs smoothly.
  • Service discovery mechanisms (reference)

    • Environment variables (reference): pods automatically get environment variables referencing any active Services when they start up. If the Service starts after the pod, then the environment is not updated (nor would it really matter even if it did get updated since processes capture the environment at start-up time (reference)).
    • DNS (reference): this is done with an add-on to Kubernetes, but the Wikipedia article says that DNS is a mandatory feature (reference).
      • Using namespaces: pods within the same namespace can just access something like “my-service”, but pods in different namespaces must also specify the namespace, e.g. “my-service.my-namespace”.
  • ServiceTypes (reference)

    • ClusterIP: for internal-only services, this will expose the service on a cluster-visible IP
    • NodePort (reference): expose the service from each node on the specified port. The service will be accessible externally via <NodeIP>:<NodePort>. This automatically creates a ClusterIP Service underneath.
      • Uses (reference): from the docs: “Using a NodePort gives you the freedom to set up your own load balancing solution, to configure environments that are not fully supported by Kubernetes, or even to just expose one or more nodes’ IPs directly.”
      • —nodeport-addresses (reference): this lets you specify which IP addresses kube-proxy will route to the node [on which kube-proxy is running].
        • [09:40] atomicnibble: say you have a node with two NIC both with different IP but you only want NodePort to be exposed on one.
        • [09:41] atomicnibble: By default all network interfaces will work.
    • LoadBalancer (reference): this uses the cloud provider’s load balancer (meaning that it’s external). This will automatically create a NodePort Service (in most cases), which itself automatically creates a ClusterIP Service, meaning your service is routable from the outside world.
      • Load-balancing features (reference): Kubernetes doesn’t necessarily natively support all of the features of your cloud’s load balancer, e.g. persistent sessions or dynamic weights.
    • ExternalName (reference): maps the service to the specified name via a CNAME record, meaning no proxying is set up. In other words, when a container of yours asks for “my-service”, Kubernetes will simply return a CNAME record like “test.example.com”. The specification of an ExternalName does not result in any “regular” DNS records being updated (slightly clearer example than the docs here).
      • Note that ExternalName on its own, just being a CNAME, means that the address that it points to is always available externally even if the underlying service isn’t. For example, I can go make an A record right now in my DNS that points to 192.168.1.1, and everyone will be able to see that resolution, but no one can contact that IP on my network.
  • External IPs (reference): if you should specify an external IP address when defining a service, then Kubernetes will know to route traffic sent to that IP address to the corresponding service.

  • Headless services (reference): when you set a service’s clusterIp to “None”, you get a headless service. The docs don’t do a great job of explaining exactly what these are, but this StackOverflow post does! Since there is no clusterIp assigned, when you query DNS for a headless service, you’ll get all IP addresses of any pods backing the service. This is useful if you want to connect to one, many, or all of them.

    • Normal services (reference): just to be clear, a “normal” service is just not a headless service.
  • Overview (reference): this is routing traffic based on the node topology itself. For example, Node A may prefer routing traffic to itself, another node in the same rack, or perhaps another node in the same availability zone.
  • Mechanism (reference): the mechanism by which this works is just defining labels via the topologyKeys field in the spec by the priority you’d like to match them in. The match itself is just an equality check between the originating node’s value and any potential destination node (e.g. if your topologyKeys contains “kubernetes.io/hostname”, then the hostname of the source and destination would have to match).
    • Priority list (reference): the first match in your priority list is considered. If no match is found, the traffic is rejected unless ”*” is manifested as a possible match.
  • Overview: I don’t think this page was great at discussing the “what” and “why” of EndpointSlices all that well, so here’s what I did learn:
    • Rationale (reference): originally (and possibly still to this day?), network endpoints for a Service were all tracked by a single Endpoints resource. This led to performance degradation.
      • [09:01] atomicnibble: @Adam13531 Endpointslices are like a opt in currently but may become default. Mainly because it’s a breaking change for some things so they wait for packages to update etc.
    • One-to-many (reference, reference2): individual Endpoints may translate into multiple EndpointSlices.
  • Overview (reference): Ingress is an API object used to manage external access to a Service, typically through HTTP/HTTPS (although if it’s not those protocols, you’d typically use a different ServiceType from ClusterIP (reference)). It can include load-balancing, SSL termination, and routing (e.g. sending “/users” to ServiceA and “/purchases” to ServiceB).
  • Ingress controller (reference): like practically every resource, an Ingress needs a controller. However, unlike most resources, the ingress controller does not automatically start as part of kube-controller-manager. Also, there are apparently caveats to learn for each ingress controller that you may use.
    • Cloud-specific controllers (reference): it looks like almost every ingress controller is cloud-specific (except for something like nginx, which is pretty popular). E.g. Amazon’s is used to set up an ALB.
  • Default back-end (reference): if incoming traffic doesn’t match host/path rules, then the traffic will be sent to the default back-end.
  • Resources instead of services (reference): you can specify a Resource instead of a Service as a back-end for an Ingress. For example, the reference link shows using a StorageBucket, which is a GCP-specific custom resource that will probably CRUD files for you (e.g. if you hit something like GET /icons, it would retrieve an image).
    • Note that the resource would be configured separately, so in the example given, it’s just routing to a StorageBucket, but what exactly that means would be defined elsewhere.
  • Path types (reference): Exact (the path matches exactly), Prefix (the path matches up to a ”/”, e.g. the “/users” prefix will match “/users/accounts”), ImplementationSpecific (you define something yourself). If traffic would match multiple paths, there are tie-breaker rules (reference).
  • IngressClass (reference): each Ingress should specify an IngressClass, which contains additional configuration for the controller implementing that Ingress.
  • Types of ingress (reference):
    • Single service: your Ingress always routes to the same Service
    • Fanout: your Ingress routes to different Services based on path (e.g. /foo vs. /bar) or host name (e.g. bar.foo.com vs. foo.bar.com) (this is name-based virtual hosting).
    • TLS (reference): you can secure traffic from the client to the Ingress using TLS, but then from the Ingress into the rest of your cluster, traffic will be in plaintext.
  • Overview (reference): if you want to control traffic flow (both ingress and egress), use a network policy. The entities between which it can control traffic are: pods, IP blocks (CIDR ranges), or namespaces. Note that you can also control this based on the combination of pod selector and namespace.
  • Pods are open (AKA “non-isolated”) by default (reference): a pod becomes isolated when it has a NetworkPolicy that selects it, and then it will not necessarily be accessible by every other pod thereafter.
  • NetworkPolicy is AND, not OR (reference): network policies are additive. Suppose you specify three rules (a CIDR range, a namespace, and a label selector), then you would only allow traffic that matches all of those constraints.
    • Default policy is wide open (reference): without changing the default (which is easy), all traffic will be allowed.
  • What you CAN’T do (reference): the bottom of the page includes a bunch of examples of what you can’t do.
  • /etc/hosts and HostAliases (reference): if you want to override hostname resolution, use the HostAliases part of the Pod spec rather than directly modifying /etc/hosts since the kubelet “owns” /etc/hosts.
  • IPv6 (reference): I just wanted to link to this without writing anything special since the reference link is short.
  • Overview: storage in containers is ephemeral; a restart/crash will cause you to lose everything that existed in the container. Volumes are essentially just directories that mount at particular points in each container in a pod.
    • Lifetime (reference): a volume’s lifetime is tied to the pod, not individual containers in the pod.
  • Types of volumes (reference): there are ~30 types of volumes in the reference list, e.g. local volumes, ones for AWS/Azure/GCP.
  • subPath (reference): if you want to use a volume across multiple containers in a pod, then you can specify a subPath that will get exposed per container (e.g. “/foo/bar” for container1, “/baz/qux” for container2).
  • External (AKA “Out-of-tree”) volume plugins (reference): container storage is formally specified in the Container Storage Interface (CSI). Any plugins not included in Kubernetes directly are considered out-of-tree and must be based on CSI (or FlexVolume, which preceded CSI).
  • Mount propagation (reference): with mount propagation, you can share volumes between containers in different pods. It can also be used to share them between containers in the same pod, but I believe you could just use mountPath for that scenario.
  • Ephemeral volumes (reference, slightly better reference): volumes that do not persist after a pod ceases to exist. This is typically good for configuration values or secrets.
  • Overview: it’s an abstraction over volumes that isn’t tied to the lifecycle of an individual pod.
    • PersistentVolumes are API objects (and also apparently they’re volume plugins) that capture the implementation of the storage itself.
    • PersistentVolumeClaims are requests for storage by a user that consume PV resources (meaning they must be matched to a PV). PVCs request a specific size and an access mode (e.g. ReadWriteOnce and ReadOnlyMany).
  • Using a persistent volume (reference): just like regular volumes, your application code doesn’t know anything about PVs. Instead, your Kubernetes configuration would have mounted a PV to a particular path. Then, when your code tries to write or read at that path, a PVC is created in the background. Kubernetes will inspect that PVC to find and mount the underlying volume to the pod so that the application can access it.
  • In-use protection (reference): there’s some level of protection to prevent a PVC from being removed from Kubernetes if they’re in active use.
  • Reclaiming (reference): if you’re done with a PV, you can either Retain (until you manually delete it) or you can Delete it. This is controlled by the persistentVolumeReclaimPolicy field in the PV spec.
  • Resizing PVCs (reference): note that the PVC itself is not resizing, but that it can cause the underlying PV to resize if it’s supported.
  • Overview: the page itself lists out the best practices. I don’t think I’d add any value by copying them here.
  • ConfigMap (reference): these are for non-secret key-value pairs. The max size of a ConfigMap is 1 MB (reference).
    • Accessing a ConfigMap from a pod (reference): ConfigMaps can be consumed by Pods as environment variables, command-line arguments, or as files on a volume. A fourth option is to directly interface with the Kubernetes API to read the ConfigMap (reference).
      • Updates to ConfigMap while the Pod is running: if you directly read from the Kubernetes API (reference) or use a mounted volume (reference), you can get updated configuration values. If you use environment variables or command-line arguments, then they will not be automatically updated.
        • Immutable ConfigMaps (reference): if you have tons of ConfigMaps, marking them as immutable can provide a performance benefit so that the API Server doesn’t need to keep watching them.
  • Secrets (reference): these are for storing sensitive data like passwords, tokens, keys, etc. They function very similarly to ConfigMaps but with extra protections and extra risks.
    • Accessing secrets from a pod (reference): secrets can be made accessible via files on a volume, environment variables, or by the kubelet when pulling images for the Pod.
      • Updates to secrets while the Pod is running (reference): just like with ConfigMap, using a mounted volume will provide automatic updates to the secrets.
        • Immutable secrets (reference): marking secrets immutable can provide a performance benefit since the API Server won’t have to keep watching them.
    • ServiceAccounts (reference): Kubernetes automatically creates secrets for its own API operations for authentication. Sometimes, it makes more sense to give access for something to a service account rather than a controller (reference).
  • Resources (reference):
    • Min + max: resources (typically CPU and memory) can be specified in the form of an optional minimum and optional maximum (reference explaining that the minimum is optional). The scheduler works to find a node with at least the requested amount of resources, then the kubelet and container runtime make sure the pod doesn’t exceed the maximum allowed.
    • CPU resources (reference): the amount of CPU is requested is in terms of a vCPU and can be fractional (e.g. “0.5” means “half a vCPU”). These units are absolute, so 0.5 means the same thing on all machines whether they have 1 or 48 cores. An alternative notation to “0.5” is “500m”, which means “500 millicpu” or “500 millicores”. The “m” notation is preferred.
    • Extended resources (reference): compute resources (memory and CPU) are the most common, but you can extend the system to advertise/request other resources. Examples include nodes with a TPU or GPU (reference) (e.g. you could make a resource named “nvidia.com/gpu” (reference)).
    • Resource quotas (reference): a cluster administrator can control resource usage per namespace, so that, for example, team A can’t just use all of the compute resources and prevent team B from running their own pods.
      • Limiting standard resources (reference): you can set a count on how many services, secrets, jobs, pods, PVCs, etc. are allowed via the quota system.
      • Quotas are in absolute units (reference): this means that if you were to add more resources to your cluster, it wouldn’t affect your quotas unless you had a custom controller running that would adjust those quotas for you.
  • kubeconfig files (reference): any file used to configure access to clusters is called a kubeconfig file (it’s just a term, not a special file name (by default, it’s ~/.kube/config)). I believe this feature is used either when you have multiple clusters or you have completely different setups (e.g. a staging cluster that can eventually act as a production cluster).
  • The 4 Cs (reference): cloud → cluster → container → code. Layers on the left must be secure in order for layers on the right to be secure (e.g. if your cloud is compromised, then it doesn’t matter if your code is rock-solid). For specifics on each layer, check the reference link. I’ve highlighted some here just for the sake of example:
    • Cloud security (reference): make sure you don’t publicly allow access to the control plane, make sure etcd is encrypted at rest and communication to/from it is secured with TLS, make sure the Cloud Controller Manager is given the least set of privileges that it needs to the cloud platform itself, etc.
    • Container security (reference): ensure you trust everything inside the container (e.g. the OS itself, any programs used like MySQL, etc.), make sure the users running inside the container have the least privilege, etc.
  • SecurityContext (reference): a SecurityContext is for containers to specify which user to run as, which group to run as, whether or run as privileged or not, whether to filter a process’s system calls, etc. SecurityContext is part of the pod’s manifest.
  • PodSecurityPolicy (reference): these are control-plane mechanisms used to enforce specific settings in a SecurityContext. By being in the control-plane, they’re a cluster-level resource. Pods cannot run in the cluster without meeting the set of conditions defined in the policy (reference).
    • AdmissionController (reference): this is an optional controller that acts on the creation or modification of pods to see if they should be admitted into the cluster based on the security context and policies.
      • Admission controllers beyond pods (reference): these are like gatekeepers that intercept API requests and can change or deny the requests altogether.
      • Dynamic admission control (reference): this concept of admission can be extended via webhooks either for mutation or validation of any resource (including custom ones), not just Pods (reference). These are different from controllers just in how you access them (reference).
  • Overview (reference): the scheduler is in charge of finding the best node for each Pod that gets created. It does so by filtering each feasible node, scoring them, then binding the Pod to the node with the highest score.
    • Scheduling criteria (reference): various factors are taken into account for scheduling, e.g. individual and collective resource requirements, policy constraints, affinity, anti-affinity, data locality, etc.
  • The scheduler itself is in a container (reference): the scheduler is a system component that runs in a container.
  • Replaceability (reference, reference2): you can always write your own scheduler if you need something custom. As shown in this link, you can extend the scheduler at various points like PreFilter, Filter, PostFilter, Reserve, Bind, PostBind, etc.
  • Taints and tolerations (reference): node affinity is a property of Pods that attracts them to nodes (note that this is not the concept of affinities/anti-affinities; taints and tolerations are not synonyms for those—see notes below). Taints, are the opposite. Tolerations are applied to pods, but taints are applied to the nodes themselves. A pod cannot be scheduled on a tainted node unless it has a specific toleration for it.
    • Format (reference): taints and tolerations are both triplets of key, value, and an effect. The key and value are just strings, e.g. key: “animal” value: “cat”. You can compare equality of values (“only schedule on giraffes”), or you can just check the existence (“only schedule if ‘animal’ is present”). The effect is NoSchedule, NoExecute, or PreferNoSchedule (see this SO post for differentiations).
  • Assigning pods to nodes (reference): if, for example, you just want a pod to always run on a node with an SSD, then you’d put a label on the node and then use nodeSelector in the Pod spec to choose only those labeled nodes. I.e. you don’t need to use taints/tolerations.
    • Assigning pods to nodes by nodename (reference): you can assign a pod directly to a node by the node’s name, but this isn’t recommended for a bunch of reasons listed at the reference link.
  • Affinities / anti-affinities (reference): again, these are not the same as taints/tolerations. These are properties of a Pod’s spec that can attract or repel them to/from either nodes or other pods.
    • Inter-pod affinity/anti-affinity example (reference): suppose you have a web server and a redis cache. You want the redis cache to be co-located with the web servers as much as possible so that caching is quick, but you don’t want multiple web servers or redis caches to be on the same nodes in case a node goes down. To solve this, you would set up an inter-pod affinity between the web server and the redis cache, and you’d set up inter-pod anti-affinities amongst a single identity (i.e. the web server pods can’t be placed with other web servers).
  • Pod Overhead (reference): the containers within your pod request a total of X CPU and Y memory, but the Pod infrastructure itself also consumes resources (let’s call it Z). As of 1.18, Pod Overhead is on by default, so the scheduler will consider Z in the scheduling equation.
  • What happens during an eviction? (reference): the kubelet fails a pod during an eviction, meaning if it was part of a Deployment, the Deployment will create another Pod to be scheduled.
  • Performance tuning (reference): when clusters become very large, you may want to tune performance more toward low latency (i.e. placing new Pods quickly) or accuracy (make stronger placement decisions). You can configure this via percentageOfNodesToScore.
  • Overview (reference): cluster administration covers everything from planning to managing a Kubernetes cluster.
  • Certificates (reference): for certificate authentication, you’ll want to generate and self-sign certs using easyrsa, openssl, or cfssl. The reference link contains examples for each tool. If you can’t use self-signed certificates (e.g. because the client node refuses to recognize them), then you’ll need a proper certificate authority like AWS, DigiCert, etc.
  • Managing resources (reference): this page talks about a few simple concepts:
    • Manage Kubernetes through files: this page about management techniques talks about how you can imperatively or declaratively manage Kubernetes and that it’s probably best for large production clusters to do things declaratively. They touch on that again a bit through the reference link.
    • Bulk-manage through kubectl (reference): you can do more than just create resources in bulk via kubectl—you can extract resource names in order to perform other operations, e.g. mass-deletions.
    • Properly label resources (reference): make sure you’re labeling practically everything, and not just with a unique label every time, but with helpful categories like “tier: backend”, “role: primary”, etc. Together, a set of labels doesn’t have to uniquely identify a single resource.
  • Pod-to-pod networking (reference): this page talks about many different networking implementations (typically third-party) for pod-to-pod communication. It also talks about core requirements of Kubernetes, e.g. each Pod gets its own IP address, each Pod on a single node can communicate with one another without using NAT.
  • Logging (reference): the container runtime typically has its own way of collecting logs, but that’s usually not sufficient for cluster-level logging since you want your logging to be resilient across crashed containers, evicted pods, dead nodes, etc.
    • Kubernetes does not handle log rotation (reference): you’ll need to make sure this is handled (e.g. by Docker itself) so that you don’t consume all available storage on a node.
    • Cluster-level logging solutions (reference): this page describes three high-level approaches to logging:
      • A node-level logging agent on every node (reference): the logging agent would typically be a container on every node that has access to a directory with log files from all application containers on that node. This would typically be implemented as a DaemonSet replica.
        • This is the most common approach to implementing cluster-level logging.
      • Using a sidecar container (reference):
        • Streaming to a sidecar container (reference): a sidecar container would read logs from somewhere (a shared file, a socket, journald, etc.) and then print the logs to stdout/stderr. Since stdout/stderr are handled by the kubelet, you can use “kubectl logs”. You would then leave rotation and retention to the kubelet itself.
        • Sidecar container with a logging agent (reference): I don’t see how this is significantly different from a node-level logging agent running on every node.
      • Exposing logs directly from the application (reference): you can push logs directly to a back-end (e.g. a database or AWS-S3-like storage) from within the application
    • System logs (reference): related to cluster-level logging, the system components themselves record events happening in the cluster. This is done with klog.
  • Metrics (reference): metrics are typically exposed by system components on the /metrics HTTP endpoint. The reference link goes into more detail about the metrics lifecycle and the Prometheus format.
  • Flow control (AKA “API Priority and Fairness” AKA “APF”) (reference): when you face a potential overload situation, you have to make sure a flood of inbound requests doesn’t crash the API Server. APF helps via priority levels, queuing (which is in addition to priority levels), and exemptions (some requests will always make it to the API Server).
  • Overview (reference): CRDs on their own are just data (reference). For example, a CRD can define a resource type like “SpacePod”, at which point Kubernetes will let you make and delete resources of “kind: SpacePod”. However, that data doesn’t really have any meaning or function until you make a controller to do something with the resources. With a controller monitoring custom resources, you gain a declarative API since you can then describe your custom resource via a YAML file, create it, and the custom controller will manage whatever it means to have that custom resource (e.g. it launches a SpacePod into space for you).
    • Example CRD: see this page. The definition is really pretty simple.
  • Controllers without CRDs: controllers don’t have to define a CRD to be useful; they can work with built-in resource types. For example, HAProxy made a Helm chart to install the HAProxy Kubernetes ingress controller (reference), and their GitHub doesn’t contain “CustomResourceDefinition” anywhere.
  • Overview (reference): this is a way of extending Kubernetes with more APIs than what the API Server provides by itself.
    • Implementation (reference): you add an APIService object to “claim” a URL. Then, anything sent to that URL will be proxied to the APIService, which is typically run in Pods just like your application itself would be.
  • Standalone APIs (reference): these refer to APIs that you define and expose yourself. For a contrived example, maybe you’re running a forums site and you want to turn off account creation via PATCH /account_creation. The API or web layer of your overall application could be the one to handle that, meaning administrators would route traffic through Kubernetes’ DNS into the server that handles the API.
  • Aggregation layer vs. CRDs (reference): both concepts are for extending Kubernetes, and they’re not mutually exclusive. The reference link has some breakdowns of specific features, so I’ll stick to a high level here:
    • CRDs are just data: remember that on their own, CRDs don’t really do anything; you need to add custom controllers as well.
    • Custom controllers run in the control plane: as such, they’re managed by Kubernetes, which can probably save you some headache when it comes to updates, or bug fixes, versioning.
    • Aggregated APIs are more flexible: there’s a host of features that you can only do through aggregated APIs, although most sound like edge cases.
    • Commonalities: there are many things you get from either choice (CRDs or AAs) (reference).
  • Overview (reference): note that an operator itself is basically a synonym for a controller (reference). The operator pattern refers to what controllers do: they run a control loop that constantly monitors some chunk of your cluster’s state, then they can make changes based on that state. It’s just like a human operator sitting at a control panel.