This is the blog section. It has two categories: News and Releases.
Content is coming soon.
This is the multi-page printable view of this section. Click here to print.
This is the blog section. It has two categories: News and Releases.
Content is coming soon.
TL;DR: In version 5.3.0 we introduced strong consistency guarantees for updates with a new API. You can now update resources (both your custom resource and managed resources) and the framework will guarantee that these updates will be instantly visible when accessing resources from caches, and naturally also for subsequent reconciliations.
I briefly talked about this topic at KubeCon last year.
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
ConfigMap managedConfigMap = prepareConfigMap(webPage);
// apply the resource with new API
context.resourceOperations().serverSideApply(managedConfigMap);
// fresh resource instantly available from our update in the caches
var upToDateResource = context.getSecondaryResource(ConfigMap.class);
// from now on built-in update methods by default use this feature;
// it is guaranteed that resource changes will be visible for next reconciliation
return UpdateControl.patchStatus(alterStatusObject(webPage));
}
In addition to that, the framework will automatically filter events for your own updates, so they don’t trigger the reconciliation again.
This should significantly simplify controller development, and will make reconciliation much simpler to reason about!
This post will deep dive into this topic, exploring the details and rationale behind it.
See the related umbrella issue on GitHub.
First, we have to understand a fundamental building block of Kubernetes operators: Informers. Since there is plentiful accessible information about this topic, here’s a brief summary. Informers:
metadata.resourceVersion is different from the version
in the cached resource, it calls the event handler, thus in our case triggering the reconciliation.A controller is usually composed of multiple informers: one tracking the primary resource, and additional informers registered for each (secondary) resource we manage. Informers are great since we don’t have to poll the Kubernetes API — it is push-based. They also provide a cache, so reconciliations are very fast since they work on top of cached resources.
Now let’s take a look at the flow when we update a resource:
graph LR
subgraph Controller
Informer:::informer
Cache[(Cache)]:::teal
Reconciler:::reconciler
Informer -->|stores| Cache
Reconciler -->|reads| Cache
end
K8S[⎈ Kubernetes API Server]:::k8s
Informer -->|watches| K8S
Reconciler -->|updates| K8S
classDef informer fill:#C0527A,stroke:#8C3057,color:#fff
classDef reconciler fill:#E8873A,stroke:#B05E1F,color:#fff
classDef teal fill:#3AAFA9,stroke:#2B807B,color:#fff
classDef k8s fill:#326CE5,stroke:#1A4AAF,color:#fffIt is easy to see that the cache of the informer is eventually consistent with the update we sent from the reconciler. It usually takes only a very short time (a few milliseconds) to sync the caches and everything is fine. Well, sometimes it isn’t. The websocket can be disconnected (which actually happens on purpose sometimes), the API Server can be slow, etc.
Let’s consider an operator with the following requirements:
PrefixedPod where the spec contains only one field: podNamePrefixpodNamePrefix changes, it should delete
the current Pod and then create a new onegeneratedPodNameHow the code would look in 5.2.x:
public UpdateControl<PrefixedPod> reconcile(PrefixedPod primary, Context<PrefixedPod> context) {
Optional<Pod> currentPod = context.getSecondaryResource(Pod.class);
if (currentPod.isPresent()) {
if (podNameHasPrefix(primary.getSpec().getPodNamePrefix() ,currentPod.get())) {
// all ok we can return
return UpdateControl.noUpdate();
} else {
// deletes the current pod with different name pattern
context.getClient().resource(currentPod.get()).delete();
// return; pod delete event will trigger the reconciliation
return UpdateControl.noUpdate();
}
} else {
// creates new pod
var newPod = context.getClient().resource(createPodWithOwnerReference(primary)).serverSideApply();
return UpdateControl.patchStatus(setGeneratedPodNameToStatus(primary,newPod));
}
}
@Override
public List<EventSource<?, PrefixedPod>> prepareEventSources(EventSourceContext<PrefixedPod> context) {
// Code omitted for adding InformerEventSource for the Pod
}
That is quite simple: if there is a Pod with a different name prefix we delete it, otherwise we create the Pod and update the status. The Pod is created with an owner reference, so any update on the Pod will trigger the reconciliation.
Now consider the following sequence of events:
PrefixedPod with spec.podNamePrefix: first-pod-prefix.second-pod-prefix.When the spec change triggers the reconciliation in point 3, there is absolutely no guarantee that:
currentPod might simply be emptystatus.generatedPodName will be visibleSince both are backed by an informer and the caches of those informers are only eventually consistent with our updates, the next reconciliation would create a new Pod, violating the requirement to not have two Pods running at the same time. In addition, the controller would override the status. Although in the case of a Kubernetes resource we can still find the existing Pods later via owner references, if we were managing a non-Kubernetes (external) resource we would not notice that we had already created one.
So can we have stronger guarantees regarding caches? It turns out we can now…
When we send an update (this also applies to various create and patch requests) to the Kubernetes API, in the response
we receive the up-to-date resource with the resource version that is the most recent at that point.
The idea is that we can cache this response in a cache on top of the Informer’s cache.
We call this cache TemporaryResourceCache (TRC), and besides caching such responses, it also plays a role in event filtering
as we will see later.
Note that the challenge in the past was knowing when to evict this response from the TRC. Eventually,
we will receive an event in the informer and the informer cache will be populated with an up-to-date resource.
But it was not possible to reliably tell whether an event contained a resource that was the result
of an update before or after our own update. The reason is that the Kubernetes documentation stated that
metadata.resourceVersion should be treated as an opaque string and matched only with equality.
Although with optimistic locking we were able to overcome this issue — see this blog post.
This changed in the Kubernetes guidelines. Now, if we can parse the resourceVersion as an integer,
we can use numerical comparison. See the related KEP.
From this point the idea of the algorithm is very simple:
sequenceDiagram
box rgba(50,108,229,0.1)
participant K8S as ⎈ Kubernetes API Server
end
box rgba(232,135,58,0.1)
participant R as Reconciler
end
box rgba(58,175,169,0.1)
participant I as Informer
participant IC as Informer Cache
participant TRC as Temporary Resource Cache
end
R->>K8S: 1. Update resource
K8S-->>R: Updated resource (with new resourceVersion)
R->>TRC: 2. Cache updated resource in TRC
I-)K8S: 3. Watch event (resource updated)
I->>TRC: On event: event resourceVersion ≥ TRC version?
alt Yes: event is up-to-date
I-->>TRC: Evict resource from TRC
else No: stale event
Note over TRC: TRC entry retained
end
R->>TRC: 4. Read resource from cache
alt Resource found in TRC
TRC-->>R: Return cached resource
else Not in TRC
R->>IC: Read from Informer Cache
IC-->>R: Return resource
endWhen we update a resource, eventually the informer will propagate an event that would trigger a reconciliation. However, this is mostly not desired. Since we already have the up-to-date resource at that point, we would like to be notified only if the resource is changed after our change. Therefore, in addition to caching the resource, we also filter out events that contain a resource version older than or equal to our cached resource version.
Note that the implementation of this is relatively complex, since while performing the update we want to record all the events received in the meantime and decide whether to propagate them further once the update request is complete.
However, this way we significantly reduce the number of reconciliations, making the whole process much more efficient.
We realize that some of our users might rely on the fact that reconciliation is triggered by their own updates. To support backwards compatibility, or rather a migration path, we now provide a way to instruct the framework to queue an instant reconciliation:
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
// omitted reconciliation logic
return UpdateControl.<WebPage>noUpdate().reschedule();
}
An alternative approach would be to not trigger the next reconciliation until the target resource appears in the Informer’s cache. The upside is that we don’t have to maintain an additional cache of the resource, just the target resource version; therefore this approach might have a smaller memory footprint, but not necessarily. See the related KEP that takes this approach.
On the other hand, when we make a request, the response object is always deserialized regardless of whether we are going to cache it or not. This object in most cases will be cached for a very short time and later garbage collected. Therefore, the memory overhead should be minimal.
Having the TRC has an additional advantage: since we have the resource instantly in our caches, we can elegantly continue the reconciliation in the same pass and reconcile resources that depend on the latest state. More concretely, this also helps with our Dependent Resources / Workflows which rely on up-to-date caches. In this sense, this approach is much more optimal regarding throughput.
I personally worked on a prototype of an operator that depended on an unreleased version of JOSDK already implementing these features. The most obvious gain was how much simpler the reasoning became in some cases and how it reduced the corner cases that we would otherwise have to solve with the expectation pattern or other facilities.
I would like to thank all the contributors who directly or indirectly contributed, including metacosm, manusa, and xstefank.
Last but certainly not least, special thanks to Steven Hawkins, who maintains the Informer implementation in the fabric8 Kubernetes client and implemented the first version of the algorithms. We then iterated on it together multiple times. Covering all the edge cases was quite an effort. Just as a highlight, I’ll mention the last one.
Thank you!
Read-cache-after-write consistency feature replaces this functionality. (since version 5.3.0)
It provides this functionality also for secondary resources and optimistic locking is not required anymore. See the docs and related blog post for details.
We recently released v5.1 of Java Operator SDK (JOSDK). One of the highlights of this release is related to a topic of so-called allocated values.
To describe the problem, let’s say that our controller needs to create a resource that has a generated identifier, i.e.
a resource which identifier cannot be directly derived from the custom resource’s desired state as specified in its
spec field. To record the fact that the resource was successfully created, and to avoid attempting to
recreate the resource again in subsequent reconciliations, it is typical for this type of controller to store the
generated identifier in the custom resource’s status field.
The Java Operator SDK relies on the informers’ cache to retrieve resources. These caches, however, are only guaranteed to be eventually consistent. It could happen that, if some other event occurs, that would result in a new reconciliation, before the update that’s been made to our resource status has the chance to be propagated first to the cluster and then back to the informer cache, that the resource in the informer cache does not contain the latest version as modified by the reconciler. This would result in a new reconciliation where the generated identifier would be missing from the resource status and, therefore, another attempt to create the resource by the reconciler, which is not what we’d like.
Java Operator SDK now provides a utility class PrimaryUpdateAndCacheUtils
to handle this particular use case. Using that overlay cache, your reconciler is guaranteed to see the most up-to-date
version of the resource on the next reconciliation:
@Override
public UpdateControl<StatusPatchCacheCustomResource> reconcile(
StatusPatchCacheCustomResource resource,
Context<StatusPatchCacheCustomResource> context) {
// omitted code
var freshCopy = createFreshCopy(resource); // need fresh copy just because we use the SSA version of update
freshCopy
.getStatus()
.setValue(statusWithAllocatedValue());
// using the utility instead of update control to patch the resource status
var updated =
PrimaryUpdateAndCacheUtils.ssaPatchStatusAndCacheResource(resource, freshCopy, context);
return UpdateControl.noUpdate();
}
How does PrimaryUpdateAndCacheUtils work?
There are multiple ways to solve this problem, but ultimately, we only provide the solution described below. If you
want to dig deep in alternatives, see
this PR.
The trick is to intercept the resource that the reconciler updated and cache that version in an additional cache on top of the informer’s cache. Subsequently, if the reconciler needs to read the resource, the SDK will first check if it is in the overlay cache and read it from there if present, otherwise read it from the informer’s cache. If the informer receives an event with a fresh resource, we always remove the resource from the overlay cache, since that is a more recent resource. But this works only if the reconciler updates the resource using optimistic locking. If the update fails on conflict, because the resource has already been updated on the cluster before we got the chance to get our update in, we simply wait and poll the informer cache until the new resource version from the server appears in the informer’s cache, and then try to apply our updates to the resource again using the updated version from the server, again with optimistic locking.
So why is optimistic locking required? We hinted at it above, but the gist of it, is that if another party updates the resource before we get a chance to, we wouldn’t be able to properly handle the resulting situation correctly in all cases. The informer would receive that new event before our own update would get a chance to propagate. Without optimistic locking, there wouldn’t be a fail-proof way to determine which update should prevail (i.e. which occurred first), in particular in the event of the informer losing the connection to the cluster or other edge cases (the joys of distributed computing!).
Optimistic locking simplifies the situation and provides us with stronger guarantees: if the update succeeds, then we can be sure we have the proper resource version in our caches. The next event will contain our update in all cases. Because we know that, we can also be sure that we can evict the cached resource in the overlay cache whenever we receive a new event. The overlay cache is only used if the SDK detects that the original resource (i.e. the one before we applied our status update in the example above) is still in the informer’s cache.
The following diagram sums up the process:
flowchart TD
A["Update Resource with Lock"] --> B{"Is Successful"}
B -- Fails on conflict --> D["Poll the Informer cache until resource updated"]
D --> A
B -- Yes --> n2{"Original resource still in informer cache?"}
n2 -- Yes --> C["Cache the resource in overlay cache"]
n2 -- No --> n3["Informer cache already contains up-to-date version, do not use overlay cache"]From version 5 of Java Operator SDK server side apply is a first-class feature and is used by default to update resources. As we will see, unfortunately (or fortunately), using it requires changes for your reconciler implementation.
For this reason, we prepared a feature flag, which you can flip if you are not prepared to migrate yet:
ConfigurationService.useSSAToPatchPrimaryResource
Setting this flag to false will make the operations done by UpdateControl using the former approach (not SSA).
Similarly, the finalizer handling won’t utilize SSA handling.
The plan is to keep this flag and allow the use of the former approach (non-SSA) also in future releases.
For dependent resources, a separate flag exists (this was true also before v5) to use SSA or not:
ConfigurationService.ssaBasedCreateUpdateMatchForDependentResources
Until version 5, changing primary resources through UpdateControl did not use server-side apply.
So usually, the implementation of the reconciler looked something like this:
@Override
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
reconcileLogicForManagedResources(webPage);
webPage.setStatus(updatedStatusForWebPage(webPage));
return UpdateControl.patchStatus(webPage);
}
In other words, after the reconciliation of managed resources, the reconciler updates the status of the primary resource passed as an argument to the reconciler. Such changes on the primary are fine since we don’t work directly with the cached object, the argument is already cloned.
So, how does this change with SSA? For SSA, the updates should contain (only) the “fully specified intent”. In other words, we should only fill in the values we care about. In practice, it means creating a fresh copy of the resource and setting only what is necessary:
@Override
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
reconcileLogicForManagedResources(webPage);
WebPage statusPatch = new WebPage();
statusPatch.setMetadata(new ObjectMetaBuilder()
.withName(webPage.getMetadata().getName())
.withNamespace(webPage.getMetadata().getNamespace())
.build());
statusPatch.setStatus(updatedStatusForWebPage(webPage));
return UpdateControl.patchStatus(statusPatch);
}
Note that we just filled out the status here since we patched the status (not the resource spec). Since the status is a sub-resource in Kubernetes, it will only update the status part.
Every controller you register will have its default field manager.
You can override the field manager name using ControllerConfiguration.fieldManager.
That will set the field manager for the primary resource and dependent resources as well.
Using the legacy or the new SSA way of resource management works well. However, migrating existing resources to SSA might be a challenge. We strongly recommend testing the migration, thus implementing an integration test where a custom resource is created using the legacy approach and is managed by the new approach.
We prepared an integration test to demonstrate how such migration, even in a simple case, can go wrong, and how to fix it.
To fix some cases, you might need to strip managed fields from the custom resource.
See StatusPatchSSAMigrationIT for details.
Feel free to report common issues, so we can prepare some utilities to handle them.
When you create a resource for SSA as mentioned above, the framework will apply changes even if the underlying resource or status subresource is changed while the reconciliation was running. First, it always forces the conflicts in the background as advised in Kubernetes docs, in addition to that since the resource version is not set it won’t do optimistic locking. If you still want to have optimistic locking for the patch, use the resource version of the original resource:
@Override
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
reconcileLogicForManagedResources(webPage);
WebPage statusPatch = new WebPage();
statusPatch.setMetadata(new ObjectMetaBuilder()
.withName(webPage.getMetadata().getName())
.withNamespace(webPage.getMetadata().getNamespace())
.withResourceVersion(webPage.getMetadata().getResourceVersion())
.build());
statusPatch.setStatus(updatedStatusForWebPage(webPage));
return UpdateControl.patchStatus(statusPatch);
}
While the idea of moving your application data to Custom Resources (CRs) aligns with the “Cloud Native” philosophy, it often introduces more challenges than benefits. Let’s break it down:
Data Size Limitations 🔴:
API Server Load Considerations 🟡:
Guarantees 🟡:
Lost Flexibility 🟡:
Infrastructure Complexity 🟠:
For small, safe subsets of data—such as application configurations—using CRs might be appropriate. However, this approach requires a detailed evaluation of the trade-offs.
While it’s tempting to unify application data with infrastructure control via CRs, this introduces risks that can outweigh the benefits. For most applications, separating concerns by using a dedicated database is the more robust, scalable, and manageable solution.
A typical “user” described in JSON:
{
"username": "myname",
"enabled": true,
"email": "myname@test.com",
"firstName": "MyFirstName",
"lastName": "MyLastName",
"credentials": [
{
"type": "password",
"value": "test"
},
{
"type": "token",
"value": "oidc"
}
],
"realmRoles": [
"user",
"viewer",
"admin"
],
"clientRoles": {
"account": [
"view-profile",
"change-group",
"manage-account"
]
}
}
This example represents about 0.5 KB of data, meaning (with standard settings) a maximum of ~2000 users can be defined in the same CR. Additionally:
We’re pleased to announce the release of Java Operator SDK v5.3.0! This minor version brings two headline features — read-cache-after-write consistency and a new metrics implementation — along with a configuration adapter system, MDC improvements, and a number of smaller improvements and cleanups.
This is the headline feature of 5.3. Informer caches are inherently eventually consistent: after your reconciler updates a resource, there is a window of time before the change is visible in the cache. This can cause subtle bugs, particularly when storing allocated values in the status sub-resource and reading them back in the next reconciliation.
From 5.3.0, the framework provides two guarantees when you use
ResourceOperations
(accessible from Context):
UpdateControl and ErrorStatusUpdateControl use this automatically. Secondary resources benefit
via context.resourceOperations():
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
ConfigMap managedConfigMap = prepareConfigMap(webPage);
// update is cached and will suppress the resulting event
context.resourceOperations().serverSideApply(managedConfigMap);
// fresh resource instantly available from the cache
var upToDateResource = context.getSecondaryResource(ConfigMap.class);
makeStatusChanges(webPage);
// UpdateControl also uses this by default
return UpdateControl.patchStatus(webPage);
}
If your reconciler relied on being re-triggered by its own writes, a new reschedule() method on
UpdateControl lets you explicitly request an immediate re-queue.
Note:
InformerEventSource.list(..)bypasses the additional caches and will not reflect in-flight updates. Usecontext.getSecondaryResources(..)orInformerEventSource.get(ResourceID)instead.
See the related blog post and reconciler docs for details.
A new micrometer-based Metrics implementation designed with low cardinality in mind. All meters
are scoped to the controller, not to individual resources, avoiding unbounded cardinality growth as
resources come and go.
MeterRegistry registry; // initialize your registry
Metrics metrics = MicrometerMetricsV2.newBuilder(registry).build();
Operator operator = new Operator(client, o -> o.withMetrics(metrics));
Optionally attach a namespace tag to per-reconciliation counters (disabled by default):
Metrics metrics = MicrometerMetricsV2.newBuilder(registry)
.withNamespaceAsTag()
.build();
The full list of meters:
| Meter | Type | Description |
|---|---|---|
reconciliations.active | gauge | Reconciler executions currently running |
reconciliations.queue | gauge | Resources queued for reconciliation |
custom_resources | gauge | Resources tracked by the controller |
reconciliations.execution.duration | timer | Execution duration with explicit histogram buckets |
reconciliations.started.total | counter | Reconciliations started |
reconciliations.success.total | counter | Successful reconciliations |
reconciliations.failure.total | counter | Failed reconciliations |
reconciliations.retries.total | counter | Retry attempts |
events.received | counter | Kubernetes events received |
The execution timer uses explicit bucket boundaries (10ms–30s) to ensure compatibility with
histogram_quantile() in both PrometheusMeterRegistry and OtlpMeterRegistry.
A ready-to-use Grafana dashboard is included at
observability/josdk-operator-metrics-dashboard.json.
The
metrics-processing sample operator
provides a complete end-to-end setup with Prometheus, Grafana, and an OpenTelemetry Collector,
installable via observability/install-observability.sh. This is a good starting point for
verifying metrics in a real cluster.
Deprecated: The original
MicrometerMetrics(V1) is deprecated as of 5.3.0. It attaches resource-specific metadata as tags to every meter, causing unbounded cardinality. Migrate toMicrometerMetricsV2.
See the observability docs for the full reference.
A new ConfigLoader bridges any key-value configuration source to the JOSDK operator and
controller configuration APIs. This lets you drive operator behaviour from environment variables,
system properties, YAML files, or any config library without writing glue code by hand.
The default instance stacks environment variables over system properties out of the box:
Operator operator = new Operator(ConfigLoader.getDefault().applyConfigs());
Built-in providers: EnvVarConfigProvider, PropertiesConfigProvider, YamlConfigProvider,
and AggregatePriorityListConfigProvider for explicit priority ordering.
ConfigProvider is a single-method interface, so adapting any config library (MicroProfile Config,
SmallRye Config, etc.) takes only a few lines:
public class SmallRyeConfigProvider implements ConfigProvider {
private final SmallRyeConfig config;
@Override
public <T> Optional<T> getValue(String key, Class<T> type) {
return config.getOptionalValue(key, type);
}
}
Pass the results when constructing the operator and registering reconcilers:
var configLoader = new ConfigLoader(new SmallRyeConfigProvider(smallRyeConfig));
Operator operator = new Operator(configLoader.applyConfigs());
operator.register(new MyReconciler(), configLoader.applyControllerConfigs(MyReconciler.NAME));
See the configuration docs for the full list of supported keys.
Note: This new configuration mechanism is useful when using the SDK by itself. Framework (Spring Boot, Quarkus, …) integrations usually provide their own configuration mechanisms that should be used instead of this new mechanism.
MDC in workflow execution: MDC context is now propagated through workflow (dependent resource graph) execution threads, not just the top-level reconciler thread. Logging from dependent resources now carries the same contextual fields as the primary reconciliation.
NO_NAMESPACE for cluster-scoped resources: Instead of omitting the resource.namespace MDC
key for cluster-scoped resources, the framework now emits MDCUtils.NO_NAMESPACE. This makes log
queries for cluster-scoped resources reliable.
When multiple event sources manage the same resource type, context.getSecondaryResources(..) now
returns a de-duplicated stream. When the same resource appears from more than one source, only the
copy with the highest resource version is returned.
Dependent resources now record their desired state in the Context during reconciliation. This allows reconcilers and
downstream dependents in a workflow to inspect what a dependent resource computed as its desired state and guarantees
that the desired state is computed only once per reconciliation.
Informer health checks no longer rely on isWatching. For readiness and startup probes, you should
primarily use hasSynced. Once an informer has started, isWatching is not suitable for liveness
checks.
createOrReplace; a locking-based createOrUpdate avoids conflicts under concurrent updates.KubernetesDependentResource uses ResourceOperations directly, removing an indirection
layer and automatically benefiting from the read-after-write guarantees.ManagedInformerEventSource.getCachedValue() deprecated: Use
context.getSecondaryResource(..) instead.exitOnStopLeading is being prepared for removal from the public API.<!-- before -->
<artifactId>operator-framework-junit-5</artifactId>
<!-- after -->
<artifactId>operator-framework-junit</artifactId>
Metrics interface renames| v5.2 | v5.3 |
|---|---|
reconcileCustomResource | reconciliationSubmitted |
reconciliationExecutionStarted | reconciliationStarted |
reconciliationExecutionFinished | reconciliationSucceeded |
failedReconciliation | reconciliationFailed |
finishedReconciliation | reconciliationFinished |
cleanupDoneFor | cleanupDone |
receivedEvent | eventReceived |
reconciliationFinished(..) is extended with RetryInfo. monitorSizeOf(..) is removed.
ResourceAction relocatedResourceAction in io.javaoperatorsdk.operator.processing.event.source.controller has been
removed. Use io.javaoperatorsdk.operator.processing.event.source.ResourceAction instead.
See the full migration guide for details.
<dependency>
<groupId>io.javaoperatorsdk</groupId>
<artifactId>operator-framework</artifactId>
<version>5.3.0</version>
</dependency>
See the comparison view for the full list of changes.
Please report issues or suggest improvements on our GitHub repository.
Happy operator building! 🚀
We’re pleased to announce the release of Java Operator SDK v5.2! This minor version brings several powerful new features and improvements that enhance the framework’s capabilities for building Kubernetes operators. This release focuses on flexibility, external resource management, and advanced reconciliation patterns.
One of the most significant improvements in 5.2 is the introduction of a unified approach to working with custom ID types
across the framework through ResourceIDMapper
and ResourceIDProvider.
Previously, when working with external resources (non-Kubernetes resources), the framework assumed resource IDs could always be represented as strings. This limitation made it challenging to work with external systems that use complex ID types.
Now, you can define custom ID types for your external resources by implementing the ResourceIDProvider interface:
public class MyExternalResource implements ResourceIDProvider<MyCustomID> {
@Override
public MyCustomID getResourceID() {
return new MyCustomID(this.id);
}
}
This capability is integrated across multiple components:
ExternalResourceCachingEventSourceExternalBulkDependentResourceAbstractExternalDependentResource and its subclassesIf you cannot modify the external resource class (e.g., it’s generated or final), you can provide a custom
ResourceIDMapper to the components above.
See the migration guide for detailed migration instructions.
Version 5.2 introduces a new execution mode that provides finer control over when reconciliation occurs. By setting
triggerReconcilerOnAllEvents
to true, your reconcile method will be called for every event, including Delete events.
This is particularly useful when:
When enabled:
reconcile method receives the last known state even if the resource is deletedContext.isPrimaryResourceDeleted()PrimaryUpdateAndCacheUtilsExample:
@ControllerConfiguration(triggerReconcilerOnAllEvents = true)
public class MyReconciler implements Reconciler<MyResource> {
@Override
public UpdateControl<MyResource> reconcile(MyResource resource, Context<MyResource> context) {
if (context.isPrimaryResourceDeleted()) {
// Handle deletion
cleanupCache(resource);
return UpdateControl.noUpdate();
}
// Normal reconciliation
return UpdateControl.patchStatus(resource);
}
}
See the detailed documentation and integration test.
The framework now provides built-in support for the expectations pattern, a common Kubernetes controller design pattern that ensures secondary resources are in an expected state before proceeding.
The expectation pattern helps avoid race conditions and ensures your controller makes decisions based on the most current
state of your resources. The implementation is available in the
io.javaoperatorsdk.operator.processing.expectation
package.
Example usage:
public class MyReconciler implements Reconciler<MyResource> {
private final ExpectationManager<MyResource> expectationManager = new ExpectationManager<>();
@Override
public UpdateControl<MyResource> reconcile(MyResource primary, Context<MyResource> context) {
// Exit early if expectation is not yet fulfilled or timed out
if (expectationManager.ongoingExpectationPresent(primary, context)) {
return UpdateControl.noUpdate();
}
var deployment = context.getSecondaryResource(Deployment.class);
if (deployment.isEmpty()) {
createDeployment(primary, context);
expectationManager.setExpectation(
primary, Duration.ofSeconds(30), deploymentReadyExpectation());
return UpdateControl.noUpdate();
}
// Check if expectation is fulfilled
var result = expectationManager.checkExpectation("deploymentReady", primary, context);
if (result.isFulfilled()) {
return updateStatusReady(primary);
} else if (result.isTimedOut()) {
return updateStatusTimeout(primary);
}
return UpdateControl.noUpdate();
}
}
This feature is marked as @Experimental as we gather feedback and may refine the API based on user experience. Future
versions may integrate this pattern directly into Dependent Resources and Workflows.
See the documentation and integration test.
You can now use field selectors when configuring InformerEventSource, allowing you to filter resources at the server
side before they’re cached locally. This reduces memory usage and network traffic by only watching resources that match
your criteria.
Field selectors work similarly to label selectors but filter on resource fields like metadata.name or status.phase:
@Informer(
fieldSelector = @FieldSelector(
fields = @Field(key = "status.phase", value = "Running")
)
)
This is particularly useful when:
See the integration test for examples.
The new AggregatedMetrics class implements the composite pattern, allowing you to combine multiple metrics
implementations. This is useful when you need to send metrics to different monitoring systems simultaneously.
// Create individual metrics instances
Metrics micrometerMetrics = MicrometerMetrics.withoutPerResourceMetrics(registry);
Metrics customMetrics = new MyCustomMetrics();
Metrics loggingMetrics = new LoggingMetrics();
// Combine them into a single aggregated instance
Metrics aggregatedMetrics = new AggregatedMetrics(List.of(
micrometerMetrics,
customMetrics,
loggingMetrics
));
// Use with your operator
Operator operator = new Operator(client, o -> o.withMetrics(aggregatedMetrics));
This enables hybrid monitoring strategies, such as sending metrics to both Prometheus and a custom logging system.
See the observability documentation for more details.
GenericRetry no longer provides a mutable singleton instance, improving thread safetyUpdated to Fabric8 Kubernetes Client 7.4.0, bringing the latest features and bug fixes from the client library.
Starting with this release, new features marked as experimental will be annotated with @Experimental. This annotation
indicates that while we intend to support the feature, the API may evolve based on user feedback.
For most users, upgrading to 5.2 should be straightforward. The main breaking change involves the introduction of
ResourceIDMapper for external resources. If you’re using external dependent resources or bulk dependents with custom
ID types, please refer to the migration guide.
Update your dependency to version 5.2.0:
<dependency>
<groupId>io.javaoperatorsdk</groupId>
<artifactId>operator-framework</artifactId>
<version>5.2.0</version>
</dependency>
You can see all changes in the comparison view.
As always, we welcome your feedback! Please report issues or suggest improvements on our GitHub repository.
Happy operator building! 🚀
We are excited to announce that Java Operator SDK v5 has been released. This significant effort contains various features and enhancements accumulated since the last major release and required changes in our APIs. Within this post, we will go through all the main changes and help you upgrade to this new version, and provide a rationale behind the changes if necessary.
We will omit descriptions of changes that should only require simple code updates; please do contact us if you encounter issues anyway.
You can see an introduction and some important changes and rationale behind them from KubeCon.
You can see all changes here.
Server Side Apply is now a first-class citizen in
the framework and
the default approach for patching the status resource. This means that patching a resource or its status through
UpdateControl and adding
the finalizer in the background will both use SSA.
Migration from a non-SSA based patching to an SSA based one can be problematic. Make sure you test the transition when
you migrate from older version of the frameworks.
To continue to use a non-SSA based on,
set ConfigurationService.useSSAToPatchPrimaryResource
to false.
See some identified problematic migration cases and how to handle them in StatusPatchSSAMigrationIT.
For more detailed description, see our blog post on SSA.
InformerEventSource now supports watching remote clusters. You can simply pass a KubernetesClient instance
initialized to connect to a different cluster from the one where the controller runs when configuring your event source.
See InformerEventSourceConfiguration.withKubernetesClient
Such an informer behaves exactly as a regular one. Owner references won’t work in this situation, though, so you have to
specify a SecondaryToPrimaryMapper (probably based on labels or annotations).
See related integration test here
The owner reference based mappers are now checking the type (kind and apiVersion) of the resource when resolving the
mapping. This is important
since a resource may have owner references to a different resource type with the same name.
See implementation details here
There are multiple smaller changes to InformerEventSource and related classes:
InformerConfiguration is renamed
to InformerEventSourceConfigurationInformerEventSourceConfiguration doesn’t require EventSourceContext to be initialized anymore.The EventSource
abstraction is now always aware of the resources and
handles accessing (the cached) resources, filtering, and additional capabilities. Before v5, such capabilities were
present only in a sub-class called ResourceEventSource,
but we decided to merge and remove ResourceEventSource since this has a nice impact on other parts of the system in
terms of architecture.
If you still need to create an EventSource that only supports triggering of your reconciler,
see TimerEventSource
for an example of how this can be accomplished.
EventSource
are now named. This reduces the ambiguity that might have existed when trying to refer to an EventSource.
You no longer have to annotate the reconciler with @ControllerConfiguration annotation.
This annotation is (one) way to override the default properties of a controller.
If the annotation is not present, the default values from the annotation are used.
PR: https://github.com/operator-framework/java-operator-sdk/pull/2203
In addition to that, the informer-related configurations are now extracted into
a separate @Informer
annotation within @ControllerConfiguration.
Hopefully this explicits which part of the configuration affects the informer associated with primary resource.
Similarly, the same @Informer annotation is used when configuring the informer associated with a managed
KubernetesDependentResource via the
KubernetesDependent
annotation.
Both the EventSourceInitializer and ErrorStatusHandler interfaces are removed, and their methods moved directly
under Reconciler.
If possible, we try to avoid such marker interfaces since it is hard to deduce related usage just by looking at the
source code.
You can now simply override those methods when implementing the Reconciler interface.
When accessing the secondary resources using Context.getSecondaryResource(s)(...),
the resources are no longer cloned by default, since
cloning could have an impact on performance. This means that you now need to ensure that these any changes
are now made directly to the underlying cached resource. This should be avoided since the same resource instance may be
present for other reconciliation cycles and would
no longer represent the state on the server.
If you want to still clone resources by default,
set ConfigurationService.cloneSecondaryResourcesWhenGettingFromCache
to true.
The automatic observed generation handling feature was removed since it is easy to implement inside the reconciler, but it made the implementation much more complex, especially if the framework would have to support it both for served side apply and client side apply.
You can check a sample implementation how to do it manually in this integration test.
The primary reason ResourceDiscriminator was introduced was to cover the case when there are
more than one dependent resources of a given type associated with a given primary resource. In this situation, JOSDK
needed a generic mechanism to
identify which resources on the cluster should be associated with which dependent resource implementation.
We improved this association mechanism, thus rendering ResourceDiscriminator obsolete.
As a replacement, the dependent resource will select the target resource based on the desired state.
See the generic implementation in AbstractDependentResource.
Calculating the desired state can be costly and might depend on other resources. For KubernetesDependentResource
it is usually enough to provide the name and namespace (if namespace-scoped) of the target resource, which is what the
KubernetesDependentResource implementation does by default. If you can determine which secondary to target without
computing the desired state via its associated ResourceID, then we encourage you to override the
ResourceID targetSecondaryResourceID()
method as shown
in this example
Read-only bulk dependent resources are now supported; this was a request from multiple users, but it required changes to the underlying APIs. Please check the documentation for further details.
See also the related integration test.
Until now, activation conditions had a limitation that only one condition was allowed for a specific resource type.
For example, two ConfigMap dependent resources were not allowed, both with activation conditions. The underlying issue
was with the informer registration process. When an activation condition is evaluated as “met” in the background,
the informer is registered dynamically for the target resource type. However, we need to avoid registering multiple
informers of the same kind. To prevent this the dependent resource must specify
the name of the informer.
See the complete example here.
getSecondaryResource is Activation condition awareWhen an activation condition for a resource type is not met, no associated informer might be registered for that
resource type. However, in this situation, calling Context.getSecondaryResource
and its alternatives would previously throw an exception. This was, however, rather confusing and a better user
experience would be to return an empty value instead of throwing an error. We changed this behavior in v5 to make it
more user-friendly and attempting to retrieve a secondary resource that is gated by an activation condition will now
return an empty value as if the associated informer existed.
See related issue for details.
@Workflow annotationThe managed workflow definition is now a separate @Workflow annotation; it is no longer part of
@ControllerConfiguration.
See sample usage here
Before v5, the managed dependents part of a workflow would always be reconciled before the primary Reconciler
reconcile or cleanup methods were called. It is now possible to explictly ask for a workflow reconciliation in your
primary Reconciler, thus allowing you to control when the workflow is reconciled. This mean you can perform all kind
of operations - typically validations - before executing the workflow, as shown in the sample below:
@Workflow(explicitInvocation = true,
dependents = @Dependent(type = ConfigMapDependent.class))
@ControllerConfiguration
public class WorkflowExplicitCleanupReconciler
implements Reconciler<WorkflowExplicitCleanupCustomResource>,
Cleaner<WorkflowExplicitCleanupCustomResource> {
@Override
public UpdateControl<WorkflowExplicitCleanupCustomResource> reconcile(
WorkflowExplicitCleanupCustomResource resource,
Context<WorkflowExplicitCleanupCustomResource> context) {
context.managedWorkflowAndDependentResourceContext().reconcileManagedWorkflow();
return UpdateControl.noUpdate();
}
@Override
public DeleteControl cleanup(WorkflowExplicitCleanupCustomResource resource,
Context<WorkflowExplicitCleanupCustomResource> context) {
context.managedWorkflowAndDependentResourceContext().cleanupManageWorkflow();
// this can be checked
// context.managedWorkflowAndDependentResourceContext().getWorkflowCleanupResult()
return DeleteControl.defaultDelete();
}
}
To turn on this mode of execution, set explicitInvocation
flag to true in the managed workflow definition.
See the following integration tests
for invocation
and cleanup.
If an exception happens during a workflow reconciliation, the framework automatically throws it further.
You can now set handleExceptionsInReconciler
to true for a workflow and check the thrown exceptions explicitly
in the execution results.
@Workflow(handleExceptionsInReconciler = true,
dependents = @Dependent(type = ConfigMapDependent.class))
@ControllerConfiguration
public class HandleWorkflowExceptionsInReconcilerReconciler
implements Reconciler<HandleWorkflowExceptionsInReconcilerCustomResource>,
Cleaner<HandleWorkflowExceptionsInReconcilerCustomResource> {
private volatile boolean errorsFoundInReconcilerResult = false;
private volatile boolean errorsFoundInCleanupResult = false;
@Override
public UpdateControl<HandleWorkflowExceptionsInReconcilerCustomResource> reconcile(
HandleWorkflowExceptionsInReconcilerCustomResource resource,
Context<HandleWorkflowExceptionsInReconcilerCustomResource> context) {
errorsFoundInReconcilerResult = context.managedWorkflowAndDependentResourceContext()
.getWorkflowReconcileResult().erroredDependentsExist();
// check errors here:
Map<DependentResource, Exception> errors = context.getErroredDependents();
return UpdateControl.noUpdate();
}
}
See integration test here.
Activation conditions are typically used to check if the cluster has specific capabilities (e.g., is cert-manager
available).
Such a check can be done by verifying if a particular custom resource definition (CRD) is present on the cluster. You
can now use the generic CRDPresentActivationCondition
for this
purpose, it will check if the CRD of a target resource type of a dependent resource exists on the cluster.
See usage in integration test here.
The Fabric8 client has been updated to version 7.0.0. This is a new major version which implies that some API might have changed. Please take a look at the Fabric8 client 7.0.0 migration guide.
Starting with v5.0 (in accordance with changes made to the Fabric8 client in version 7.0.0), the CRD generator will use the maven plugin instead of the annotation processor as was previously the case. In many instances, you can simply configure the plugin by adding the following stanza to your project’s POM build configuration:
<plugin>
<groupId>io.fabric8</groupId>
<artifactId>crd-generator-maven-plugin</artifactId>
<version>${fabric8-client.version}</version>
<executions>
<execution>
<goals>
<goal>generate</goal>
</goals>
</execution>
</executions>
</plugin>
NOTE: If you use the SDK’s JUnit extension for your tests, you might also need to configure the CRD generator plugin to access your test CustomResource implementations as follows:
<plugin>
<groupId>io.fabric8</groupId>
<artifactId>crd-generator-maven-plugin</artifactId>
<version>${fabric8-client.version}</version>
<executions>
<execution>
<goals>
<goal>generate</goal>
</goals>
<phase>process-test-classes</phase>
<configuration>
<classesToScan>${project.build.testOutputDirectory}</classesToScan>
<classpath>WITH_ALL_DEPENDENCIES_AND_TESTS</classpath>
</configuration>
</execution>
</executions>
</plugin>
Please refer to the CRD generator documentation for more details.
You can now check if the subsequent reconciliation will happen right after the current one because the SDK has already
received an event that will trigger a new reconciliation
This information is available from
the Context.
Note that this could be useful, for example, in situations when a heavy task would be repeated in the follow-up reconciliation. In the current reconciliation, you can check this flag and return to avoid unneeded processing. Note that this is a semi-experimental feature, so please let us know if you found this helpful.
@Override
public UpdateControl<NextReconciliationImminentCustomResource> reconcile(MyCustomResource resource, Context<MyCustomResource> context) {
if (context.isNextReconciliationImminent()) {
// your logic, maybe return?
}
}
See related integration test.
See release notes here.