Observability

Runtime Info

RuntimeInfo is used mainly to check the actual health of event sources. Based on this information it is easy to implement custom liveness probes.

stopOnInformerErrorDuringStartup setting, where this flag usually needs to be set to false, in order to control the exact liveness properties.

See also an example implementation in the WebPage sample

Contextual Info for Logging with MDC

Logging is enhanced with additional contextual information using MDC. The following attributes are available in most parts of reconciliation logic and during the execution of the controller:

MDC Key	Value added from primary resource
`resource.apiVersion`	`.apiVersion`
`resource.kind`	`.kind`
`resource.name`	`.metadata.name`
`resource.namespace`	`.metadata.namespace`
`resource.resourceVersion`	`.metadata.resourceVersion`
`resource.generation`	`.metadata.generation`
`resource.uid`	`.metadata.uid`

For more information about MDC see this link.

Metrics

JOSDK provides built-in support for metrics reporting on what is happening with your reconcilers in the form of the Metrics interface which can be implemented to connect to your metrics provider of choice, JOSDK calling the methods as it goes about reconciling resources. By default, a no-operation implementation is provided thus providing a no-cost sane default. A micrometer-based implementation is also provided.

You can use a different implementation by overriding the default one provided by the default ConfigurationService, as follows:

Metrics metrics; // initialize your metrics implementation
Operator operator = new Operator(client, o -> o.withMetrics(metrics));

Micrometer implementation

The micrometer implementation is typically created using one of the provided factory methods which, depending on which is used, will return either a ready to use instance or a builder allowing users to customized how the implementation behaves, in particular when it comes to the granularity of collected metrics. It is, for example, possible to collect metrics on a per-resource basis via tags that are associated with meters. This is the default, historical behavior but this will change in a future version of JOSDK because this dramatically increases the cardinality of metrics, which could lead to performance issues.

To create a MicrometerMetrics implementation that behaves how it has historically behaved, you can just create an instance via:

MeterRegistry registry; // initialize your registry implementation
Metrics metrics = new MicrometerMetrics(registry);

Note, however, that this constructor is deprecated and we encourage you to use the factory methods instead, which either return a fully pre-configured instance or a builder object that will allow you to configure more easily how the instance will behave. You can, for example, configure whether or not the implementation should collect metrics on a per-resource basis, whether or not associated meters should be removed when a resource is deleted and how the clean-up is performed. See the relevant classes documentation for more details.

For example, the following will create a MicrometerMetrics instance configured to collect metrics on a per-resource basis, deleting the associated meters after 5 seconds when a resource is deleted, using up to 2 threads to do so.

MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
        .withCleanUpDelayInSeconds(5)
        .withCleaningThreadNumber(2)
        .build();

Operator SDK metrics

The micrometer implementation records the following metrics:

Meter name	Type	Tag names	Description
operator.sdk.reconciliations.executions.`<reconciler name>`	gauge	group, version, kind	Number of executions of the named reconciler
operator.sdk.reconciliations.queue.size.`<reconciler name>`	gauge	group, version, kind	How many resources are queued to get reconciled by named reconciler
operator.sdk.`<map name>`.size	gauge map size		Gauge tracking the size of a specified map (currently unused but could be used to monitor caches size)
operator.sdk.events.received	counter	`<resource metadata>`, event, action	Number of received Kubernetes events
operator.sdk.events.delete	counter	`<resource metadata>`	Number of received Kubernetes delete events
operator.sdk.reconciliations.started	counter	`<resource metadata>`, reconciliations.retries.last, reconciliations.retries.number	Number of started reconciliations per resource type
operator.sdk.reconciliations.failed	counter	`<resource metadata>`, exception	Number of failed reconciliations per resource type
operator.sdk.reconciliations.success	counter	`<resource metadata>`	Number of successful reconciliations per resource type
operator.sdk.controllers.execution.reconcile	timer	`<resource metadata>`, controller	Time taken for reconciliations per controller
operator.sdk.controllers.execution.cleanup	timer	`<resource metadata>`, controller	Time taken for cleanups per controller
operator.sdk.controllers.execution.reconcile.success	counter	controller, type	Number of successful reconciliations per controller
operator.sdk.controllers.execution.reconcile.failure	counter	controller, exception	Number of failed reconciliations per controller
operator.sdk.controllers.execution.cleanup.success	counter	controller, type	Number of successful cleanups per controller
operator.sdk.controllers.execution.cleanup.failure	counter	controller, exception	Number of failed cleanups per controller

As you can see all the recorded metrics start with the operator.sdk prefix. <resource metadata>, in the table above, refers to resource-specific metadata and depends on the considered metric and how the implementation is configured and could be summed up as follows: group?, version, kind, [name, namespace?], scope where the tags in square brackets ([]) won’t be present when per-resource collection is disabled and tags followed by a question mark are omitted if the associated value is empty. Of note, when in the context of controllers’ execution metrics, these tag names are prefixed with resource.. This prefix might be removed in a future version for greater consistency.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified March 21, 2025: docs: new structure of docs (#2737) (cc3d0cc1)