This is the multi-page printable view of this section. Click here to print.
Posts
Welcome read-cache-after-write consistency!
TL;DR: In version 5.3.0 we introduced strong consistency guarantees for updates with a new API. You can now update resources (both your custom resource and managed resources) and the framework will guarantee that these updates will be instantly visible when accessing resources from caches, and naturally also for subsequent reconciliations.
I briefly talked about this topic at KubeCon last year.
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
ConfigMap managedConfigMap = prepareConfigMap(webPage);
// apply the resource with new API
context.resourceOperations().serverSideApply(managedConfigMap);
// fresh resource instantly available from our update in the caches
var upToDateResource = context.getSecondaryResource(ConfigMap.class);
// from now on built-in update methods by default use this feature;
// it is guaranteed that resource changes will be visible for next reconciliation
return UpdateControl.patchStatus(alterStatusObject(webPage));
}
In addition to that, the framework will automatically filter events for your own updates, so they don’t trigger the reconciliation again.
This should significantly simplify controller development, and will make reconciliation much simpler to reason about!
This post will deep dive into this topic, exploring the details and rationale behind it.
See the related umbrella issue on GitHub.
Informers and eventual consistency
First, we have to understand a fundamental building block of Kubernetes operators: Informers. Since there is plentiful accessible information about this topic, here’s a brief summary. Informers:
- Watch Kubernetes resources — the K8S API sends events if a resource changes to the client through a websocket. An event usually contains the whole resource. (There are some exceptions, see Bookmarks.) See details about watch as a K8S API concept in the official docs.
- Cache the latest state of the resource.
- If an informer receives an event in which the
metadata.resourceVersionis different from the version in the cached resource, it calls the event handler, thus in our case triggering the reconciliation.
A controller is usually composed of multiple informers: one tracking the primary resource, and additional informers registered for each (secondary) resource we manage. Informers are great since we don’t have to poll the Kubernetes API — it is push-based. They also provide a cache, so reconciliations are very fast since they work on top of cached resources.
Now let’s take a look at the flow when we update a resource:
graph LR
subgraph Controller
Informer:::informer
Cache[(Cache)]:::teal
Reconciler:::reconciler
Informer -->|stores| Cache
Reconciler -->|reads| Cache
end
K8S[⎈ Kubernetes API Server]:::k8s
Informer -->|watches| K8S
Reconciler -->|updates| K8S
classDef informer fill:#C0527A,stroke:#8C3057,color:#fff
classDef reconciler fill:#E8873A,stroke:#B05E1F,color:#fff
classDef teal fill:#3AAFA9,stroke:#2B807B,color:#fff
classDef k8s fill:#326CE5,stroke:#1A4AAF,color:#fffIt is easy to see that the cache of the informer is eventually consistent with the update we sent from the reconciler. It usually takes only a very short time (a few milliseconds) to sync the caches and everything is fine. Well, sometimes it isn’t. The websocket can be disconnected (which actually happens on purpose sometimes), the API Server can be slow, etc.
The problem(s) we try to solve
Let’s consider an operator with the following requirements:
- we have a custom resource
PrefixedPodwhere the spec contains only one field:podNamePrefix - the goal of the operator is to create a Pod with a name that has the prefix and a random suffix
- it should never run two Pods at once; if the
podNamePrefixchanges, it should delete the current Pod and then create a new one - the status of the custom resource should contain the
generatedPodName
How the code would look in 5.2.x:
public UpdateControl<PrefixedPod> reconcile(PrefixedPod primary, Context<PrefixedPod> context) {
Optional<Pod> currentPod = context.getSecondaryResource(Pod.class);
if (currentPod.isPresent()) {
if (podNameHasPrefix(primary.getSpec().getPodNamePrefix() ,currentPod.get())) {
// all ok we can return
return UpdateControl.noUpdate();
} else {
// deletes the current pod with different name pattern
context.getClient().resource(currentPod.get()).delete();
// return; pod delete event will trigger the reconciliation
return UpdateControl.noUpdate();
}
} else {
// creates new pod
var newPod = context.getClient().resource(createPodWithOwnerReference(primary)).serverSideApply();
return UpdateControl.patchStatus(setGeneratedPodNameToStatus(primary,newPod));
}
}
@Override
public List<EventSource<?, PrefixedPod>> prepareEventSources(EventSourceContext<PrefixedPod> context) {
// Code omitted for adding InformerEventSource for the Pod
}
That is quite simple: if there is a Pod with a different name prefix we delete it, otherwise we create the Pod and update the status. The Pod is created with an owner reference, so any update on the Pod will trigger the reconciliation.
Now consider the following sequence of events:
- We create a
PrefixedPodwithspec.podNamePrefix:first-pod-prefix. - Concurrently:
- The reconciliation logic runs and creates a Pod with a generated name suffix: “first-pod-prefix-a3j3ka”; it also sets this in the status and updates the custom resource status.
- While the reconciliation is running, we update the custom resource to have the value
second-pod-prefix.
- The update of the custom resource triggers the reconciliation.
When the spec change triggers the reconciliation in point 3, there is absolutely no guarantee that:
- the created Pod will already be visible —
currentPodmight simply be empty - the
status.generatedPodNamewill be visible
Since both are backed by an informer and the caches of those informers are only eventually consistent with our updates, the next reconciliation would create a new Pod, violating the requirement to not have two Pods running at the same time. In addition, the controller would override the status. Although in the case of a Kubernetes resource we can still find the existing Pods later via owner references, if we were managing a non-Kubernetes (external) resource we would not notice that we had already created one.
So can we have stronger guarantees regarding caches? It turns out we can now…
Achieving read-cache-after-write consistency
When we send an update (this also applies to various create and patch requests) to the Kubernetes API, in the response
we receive the up-to-date resource with the resource version that is the most recent at that point.
The idea is that we can cache this response in a cache on top of the Informer’s cache.
We call this cache TemporaryResourceCache (TRC), and besides caching such responses, it also plays a role in event filtering
as we will see later.
Note that the challenge in the past was knowing when to evict this response from the TRC. Eventually,
we will receive an event in the informer and the informer cache will be populated with an up-to-date resource.
But it was not possible to reliably tell whether an event contained a resource that was the result
of an update before or after our own update. The reason is that the Kubernetes documentation stated that
metadata.resourceVersion should be treated as an opaque string and matched only with equality.
Although with optimistic locking we were able to overcome this issue — see this blog post.
This changed in the Kubernetes guidelines. Now, if we can parse the resourceVersion as an integer,
we can use numerical comparison. See the related KEP.
From this point the idea of the algorithm is very simple:
- After updating a Kubernetes resource, cache the response in the TRC.
- When the informer propagates an event, check if its resource version is greater than or equal to the one in the TRC. If yes, evict the resource from the TRC.
- When the controller reads a resource from cache, it checks the TRC first, then falls back to the Informer’s cache.
sequenceDiagram
box rgba(50,108,229,0.1)
participant K8S as ⎈ Kubernetes API Server
end
box rgba(232,135,58,0.1)
participant R as Reconciler
end
box rgba(58,175,169,0.1)
participant I as Informer
participant IC as Informer Cache
participant TRC as Temporary Resource Cache
end
R->>K8S: 1. Update resource
K8S-->>R: Updated resource (with new resourceVersion)
R->>TRC: 2. Cache updated resource in TRC
I-)K8S: 3. Watch event (resource updated)
I->>TRC: On event: event resourceVersion ≥ TRC version?
alt Yes: event is up-to-date
I-->>TRC: Evict resource from TRC
else No: stale event
Note over TRC: TRC entry retained
end
R->>TRC: 4. Read resource from cache
alt Resource found in TRC
TRC-->>R: Return cached resource
else Not in TRC
R->>IC: Read from Informer Cache
IC-->>R: Return resource
endFiltering events for our own updates
When we update a resource, eventually the informer will propagate an event that would trigger a reconciliation. However, this is mostly not desired. Since we already have the up-to-date resource at that point, we would like to be notified only if the resource is changed after our change. Therefore, in addition to caching the resource, we also filter out events that contain a resource version older than or equal to our cached resource version.
Note that the implementation of this is relatively complex, since while performing the update we want to record all the events received in the meantime and decide whether to propagate them further once the update request is complete.
However, this way we significantly reduce the number of reconciliations, making the whole process much more efficient.
The case for instant reschedule
We realize that some of our users might rely on the fact that reconciliation is triggered by their own updates. To support backwards compatibility, or rather a migration path, we now provide a way to instruct the framework to queue an instant reconciliation:
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
// omitted reconciliation logic
return UpdateControl.<WebPage>noUpdate().reschedule();
}
Additional considerations and alternatives
An alternative approach would be to not trigger the next reconciliation until the target resource appears in the Informer’s cache. The upside is that we don’t have to maintain an additional cache of the resource, just the target resource version; therefore this approach might have a smaller memory footprint, but not necessarily. See the related KEP that takes this approach.
On the other hand, when we make a request, the response object is always deserialized regardless of whether we are going to cache it or not. This object in most cases will be cached for a very short time and later garbage collected. Therefore, the memory overhead should be minimal.
Having the TRC has an additional advantage: since we have the resource instantly in our caches, we can elegantly continue the reconciliation in the same pass and reconcile resources that depend on the latest state. More concretely, this also helps with our Dependent Resources / Workflows which rely on up-to-date caches. In this sense, this approach is much more optimal regarding throughput.
Conclusion
I personally worked on a prototype of an operator that depended on an unreleased version of JOSDK already implementing these features. The most obvious gain was how much simpler the reasoning became in some cases and how it reduced the corner cases that we would otherwise have to solve with the expectation pattern or other facilities.
Special thanks
I would like to thank all the contributors who directly or indirectly contributed, including metacosm, manusa, and xstefank.
Last but certainly not least, special thanks to Steven Hawkins, who maintains the Informer implementation in the fabric8 Kubernetes client and implemented the first version of the algorithms. We then iterated on it together multiple times. Covering all the edge cases was quite an effort. Just as a highlight, I’ll mention the last one.
Thank you!
Related
- Same initiative in golang controller-runtime
- Comparable Resource Versions in Kubernetes
- Stale Controller Handling KEP
How to guarantee allocated values for next reconciliation
Read-cache-after-write consistency feature replaces this functionality. (since version 5.3.0)
It provides this functionality also for secondary resources and optimistic locking is not required anymore. See the docs and related blog post for details.
We recently released v5.1 of Java Operator SDK (JOSDK). One of the highlights of this release is related to a topic of so-called allocated values.
To describe the problem, let’s say that our controller needs to create a resource that has a generated identifier, i.e.
a resource which identifier cannot be directly derived from the custom resource’s desired state as specified in its
spec field. To record the fact that the resource was successfully created, and to avoid attempting to
recreate the resource again in subsequent reconciliations, it is typical for this type of controller to store the
generated identifier in the custom resource’s status field.
The Java Operator SDK relies on the informers’ cache to retrieve resources. These caches, however, are only guaranteed to be eventually consistent. It could happen that, if some other event occurs, that would result in a new reconciliation, before the update that’s been made to our resource status has the chance to be propagated first to the cluster and then back to the informer cache, that the resource in the informer cache does not contain the latest version as modified by the reconciler. This would result in a new reconciliation where the generated identifier would be missing from the resource status and, therefore, another attempt to create the resource by the reconciler, which is not what we’d like.
Java Operator SDK now provides a utility class PrimaryUpdateAndCacheUtils
to handle this particular use case. Using that overlay cache, your reconciler is guaranteed to see the most up-to-date
version of the resource on the next reconciliation:
@Override
public UpdateControl<StatusPatchCacheCustomResource> reconcile(
StatusPatchCacheCustomResource resource,
Context<StatusPatchCacheCustomResource> context) {
// omitted code
var freshCopy = createFreshCopy(resource); // need fresh copy just because we use the SSA version of update
freshCopy
.getStatus()
.setValue(statusWithAllocatedValue());
// using the utility instead of update control to patch the resource status
var updated =
PrimaryUpdateAndCacheUtils.ssaPatchStatusAndCacheResource(resource, freshCopy, context);
return UpdateControl.noUpdate();
}
How does PrimaryUpdateAndCacheUtils work?
There are multiple ways to solve this problem, but ultimately, we only provide the solution described below. If you
want to dig deep in alternatives, see
this PR.
The trick is to intercept the resource that the reconciler updated and cache that version in an additional cache on top of the informer’s cache. Subsequently, if the reconciler needs to read the resource, the SDK will first check if it is in the overlay cache and read it from there if present, otherwise read it from the informer’s cache. If the informer receives an event with a fresh resource, we always remove the resource from the overlay cache, since that is a more recent resource. But this works only if the reconciler updates the resource using optimistic locking. If the update fails on conflict, because the resource has already been updated on the cluster before we got the chance to get our update in, we simply wait and poll the informer cache until the new resource version from the server appears in the informer’s cache, and then try to apply our updates to the resource again using the updated version from the server, again with optimistic locking.
So why is optimistic locking required? We hinted at it above, but the gist of it, is that if another party updates the resource before we get a chance to, we wouldn’t be able to properly handle the resulting situation correctly in all cases. The informer would receive that new event before our own update would get a chance to propagate. Without optimistic locking, there wouldn’t be a fail-proof way to determine which update should prevail (i.e. which occurred first), in particular in the event of the informer losing the connection to the cluster or other edge cases (the joys of distributed computing!).
Optimistic locking simplifies the situation and provides us with stronger guarantees: if the update succeeds, then we can be sure we have the proper resource version in our caches. The next event will contain our update in all cases. Because we know that, we can also be sure that we can evict the cached resource in the overlay cache whenever we receive a new event. The overlay cache is only used if the SDK detects that the original resource (i.e. the one before we applied our status update in the example above) is still in the informer’s cache.
The following diagram sums up the process:
flowchart TD
A["Update Resource with Lock"] --> B{"Is Successful"}
B -- Fails on conflict --> D["Poll the Informer cache until resource updated"]
D --> A
B -- Yes --> n2{"Original resource still in informer cache?"}
n2 -- Yes --> C["Cache the resource in overlay cache"]
n2 -- No --> n3["Informer cache already contains up-to-date version, do not use overlay cache"]From legacy approach to server-side apply
From version 5 of Java Operator SDK server side apply is a first-class feature and is used by default to update resources. As we will see, unfortunately (or fortunately), using it requires changes for your reconciler implementation.
For this reason, we prepared a feature flag, which you can flip if you are not prepared to migrate yet:
ConfigurationService.useSSAToPatchPrimaryResource
Setting this flag to false will make the operations done by UpdateControl using the former approach (not SSA).
Similarly, the finalizer handling won’t utilize SSA handling.
The plan is to keep this flag and allow the use of the former approach (non-SSA) also in future releases.
For dependent resources, a separate flag exists (this was true also before v5) to use SSA or not:
ConfigurationService.ssaBasedCreateUpdateMatchForDependentResources
Resource handling without and with SSA
Until version 5, changing primary resources through UpdateControl did not use server-side apply.
So usually, the implementation of the reconciler looked something like this:
@Override
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
reconcileLogicForManagedResources(webPage);
webPage.setStatus(updatedStatusForWebPage(webPage));
return UpdateControl.patchStatus(webPage);
}
In other words, after the reconciliation of managed resources, the reconciler updates the status of the primary resource passed as an argument to the reconciler. Such changes on the primary are fine since we don’t work directly with the cached object, the argument is already cloned.
So, how does this change with SSA? For SSA, the updates should contain (only) the “fully specified intent”. In other words, we should only fill in the values we care about. In practice, it means creating a fresh copy of the resource and setting only what is necessary:
@Override
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
reconcileLogicForManagedResources(webPage);
WebPage statusPatch = new WebPage();
statusPatch.setMetadata(new ObjectMetaBuilder()
.withName(webPage.getMetadata().getName())
.withNamespace(webPage.getMetadata().getNamespace())
.build());
statusPatch.setStatus(updatedStatusForWebPage(webPage));
return UpdateControl.patchStatus(statusPatch);
}
Note that we just filled out the status here since we patched the status (not the resource spec). Since the status is a sub-resource in Kubernetes, it will only update the status part.
Every controller you register will have its default field manager.
You can override the field manager name using ControllerConfiguration.fieldManager.
That will set the field manager for the primary resource and dependent resources as well.
Migrating to SSA
Using the legacy or the new SSA way of resource management works well. However, migrating existing resources to SSA might be a challenge. We strongly recommend testing the migration, thus implementing an integration test where a custom resource is created using the legacy approach and is managed by the new approach.
We prepared an integration test to demonstrate how such migration, even in a simple case, can go wrong, and how to fix it.
To fix some cases, you might need to strip managed fields from the custom resource.
See StatusPatchSSAMigrationIT for details.
Feel free to report common issues, so we can prepare some utilities to handle them.
Optimistic concurrency control
When you create a resource for SSA as mentioned above, the framework will apply changes even if the underlying resource or status subresource is changed while the reconciliation was running. First, it always forces the conflicts in the background as advised in Kubernetes docs, in addition to that since the resource version is not set it won’t do optimistic locking. If you still want to have optimistic locking for the patch, use the resource version of the original resource:
@Override
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
reconcileLogicForManagedResources(webPage);
WebPage statusPatch = new WebPage();
statusPatch.setMetadata(new ObjectMetaBuilder()
.withName(webPage.getMetadata().getName())
.withNamespace(webPage.getMetadata().getNamespace())
.withResourceVersion(webPage.getMetadata().getResourceVersion())
.build());
statusPatch.setStatus(updatedStatusForWebPage(webPage));
return UpdateControl.patchStatus(statusPatch);
}
Using k8s' ETCD as your application DB
FAQ: Is Kubernetes’ ETCD the Right Database for My Application?
Answer
While the idea of moving your application data to Custom Resources (CRs) aligns with the “Cloud Native” philosophy, it often introduces more challenges than benefits. Let’s break it down:
Top Reasons Why Storing Data in ETCD Through CRs Looks Appealing
- Storing application data as CRs enables treating your application’s data like infrastructure:
- GitOps compatibility: Declarative content can be stored in Git repositories, ensuring reproducibility.
- Infrastructure alignment: Application data can follow the same workflow as other infrastructure components.
Challenges of Using Kubernetes’ ETCD as Your Application’s Database
Technical Limitations:
Data Size Limitations 🔴:
- Each CR is capped at 1.5 MB by default. Raising this limit is possible but impacts cluster performance.
- Kubernetes ETCD has a storage cap of 2 GB by default. Adjusting this limit affects the cluster globally, with potential performance degradation.
API Server Load Considerations 🟡:
- The Kubernetes API server is designed to handle infrastructure-level requests.
- Storing application data in CRs might add significant load to the API server, requiring it to be scaled appropriately to handle both infrastructure and application demands.
- This added load can impact cluster performance and increase operational complexity.
Guarantees 🟡:
- Efficient queries are hard to implement and there is no support for them.
- ACID properties are challenging to leverage and everything holds mostly in read-only mode.
Operational Impact:
Lost Flexibility 🟡:
- Modifying application data requires complex YAML editing and full redeployment.
- This contrasts with traditional databases that often feature user-friendly web UIs or APIs for real-time updates.
Infrastructure Complexity 🟠:
- Backup, restore, and lifecycle management for application data are typically separate from deployment workflows.
- Storing both in ETCD mixes these concerns, complicating operations and standardization.
Security:
- Governance and Security 🔴:
- Sensitive data stored in plain YAML may lack adequate encryption or access controls.
- Applying governance policies over text-based files can become a significant challenge.
When Might Using CRs Make Sense?
For small, safe subsets of data—such as application configurations—using CRs might be appropriate. However, this approach requires a detailed evaluation of the trade-offs.
Conclusion
While it’s tempting to unify application data with infrastructure control via CRs, this introduces risks that can outweigh the benefits. For most applications, separating concerns by using a dedicated database is the more robust, scalable, and manageable solution.
A Practical Example
A typical “user” described in JSON:
{
"username": "myname",
"enabled": true,
"email": "myname@test.com",
"firstName": "MyFirstName",
"lastName": "MyLastName",
"credentials": [
{
"type": "password",
"value": "test"
},
{
"type": "token",
"value": "oidc"
}
],
"realmRoles": [
"user",
"viewer",
"admin"
],
"clientRoles": {
"account": [
"view-profile",
"change-group",
"manage-account"
]
}
}
This example represents about 0.5 KB of data, meaning (with standard settings) a maximum of ~2000 users can be defined in the same CR. Additionally:
- It contains sensitive information, which should be securely stored.
- Regulatory rules (like GDPR) apply.