Reading APIM v2 Gateway Logs: KQL Recipes for the Three Things That Actually Break

20 April 2026 / 8 min read

Azure APIM KQL Observability

An APIM gateway returns 500 to a client, or 401, or 503. The response body is either empty or a generic one-liner. The client logs tell you nothing useful. The backend logs (if you can reach them) show either no request received or a different error than what the client saw. You are in the gap between what the gateway did and what the client perceived.

That gap is filled by the gateway logs, in Log Analytics, reachable by KQL. If you have diagnostic settings enabled on your APIM instance routing GatewayLogs to a workspace, every request the gateway handled is already there with enough context to answer “what happened.” This post is the three queries I reach for most often, because they cover the failure modes that actually show up in production.

Before the recipes: what gateway logs look like

APIM v2 routes diagnostics through the standard AzureDiagnostics table. The category to filter on is GatewayLogs. A useful base query:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.APIMANAGEMENT"
| where Category == "GatewayLogs"
| project TimeGenerated, ApiId_s, OperationName_s, Method_s, Url_s,
          ResponseCode_d, BackendResponseCode_d, ErrorReason_s,
          LastErrorMessage_s, LastErrorSource_s, CorrelationId_g
| order by TimeGenerated desc

Every row is one request the gateway handled. ResponseCode_d is what the client saw. BackendResponseCode_d is what the backend returned to the gateway (blank if the request never got to a backend). ErrorReason_s and LastErrorMessage_s are where the useful string lives when something went wrong. LastErrorSource_s tells you whether the gateway, the backend, or a policy produced the error.

Three signatures to recognise.

Recipe 1: Backend TLS rejection

The client gets a 500. The gateway log shows ResponseCode_d == 500, BackendResponseCode_d is empty, and LastErrorMessage_s contains the string:

The remote certificate was rejected by the provided RemoteCertificateValidationCallback

What the gateway is telling you: the backend did present a certificate, the gateway received it, and the gateway rejected it during its TLS validation callback. This happens when the backend certificate is signed by a CA that the gateway’s trust store does not include, most commonly an internal or corporate CA.

On classic APIM, you solved this by uploading the CA to the service-level global certificate store. On v2, that store is not consulted for backend validation. The replacement is serverX509Names pinning on each backend entity. The backend CA trust post walks through the full fix.

The query to find all backends currently hitting this:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.APIMANAGEMENT"
| where Category == "GatewayLogs"
| where LastErrorMessage_s contains "RemoteCertificateValidationCallback"
| summarize
    failures = count(),
    first_seen = min(TimeGenerated),
    last_seen = max(TimeGenerated),
    sample_url = any(Url_s)
  by ApiId_s, BackendUrl_s
| order by failures desc

Run this during an incident and you get a list of affected backends ranked by blast radius. Run it periodically and you catch silent certificate-rotation drift before someone pages you.

Two secondary signatures in the same family are worth recognising because they look similar but mean different things:

A connection attempt failed because the connected party did not properly respond is a TCP-level timeout, not a TLS issue. Backend is unreachable or slow to accept the connection.
The handshake failed due to an unexpected packet format means the TLS handshake started but broke at the protocol level, often because the backend is not actually serving TLS on the port you configured or because an intermediate device is terminating the connection.

Filter those out explicitly if you are triaging specifically for CA trust problems.

Recipe 2: Subscription-key-present-but-still-401

The client sent Ocp-Apim-Subscription-Key on the request. The gateway returned 401 anyway. The client’s first instinct is “the key is wrong.” The key is usually fine; the request is being rejected for reasons that look like an auth failure from the client’s side but are actually policy or subscription scoping issues.

In practice, APIM evaluates subscription keys against the specific product or API the request is targeting. If a client holds a subscription key for Product A and hits an API that only belongs to Product B, the key is technically valid but not authorised for this endpoint. The response is still 401, but LastErrorMessage_s in the gateway log disambiguates the reason.

The query to break down 401s by reason:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.APIMANAGEMENT"
| where Category == "GatewayLogs"
| where ResponseCode_d == 401
| summarize count() by ErrorReason_s, LastErrorMessage_s
| order by count_ desc

The distinct values of LastErrorMessage_s on 401 rows are the useful signal. Common ones and what each means:

Access denied due to missing subscription key means the client sent no key at all. Either a client bug or a misrouted request.
Access denied due to invalid subscription key means the key was sent but does not match any active subscription on this product. Usually a wrong-environment mistake (prod key against dev gateway or vice versa) or a revoked subscription.
Access denied. Reason: Subscription not found means the key matches a record but the subscription is tied to a product the requested API is not part of. This is the silent product-scoping miss.
Unable to parse and validate JWT token is not strictly a subscription issue, but it often coexists on APIs that require both subscription keys and OAuth/JWT. The validate-jwt policy ran before anything else and failed before the subscription check even got a chance.

Policy ordering matters for the last one. If <validate-jwt> runs before <check-header> or the implicit subscription-key check in the inbound policy, a missing or invalid JWT produces a 401 that looks like a subscription problem from the gateway log unless you read LastErrorMessage_s carefully.

To catch the product-scoping case specifically:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.APIMANAGEMENT"
| where Category == "GatewayLogs"
| where ResponseCode_d == 401
| where LastErrorMessage_s has "Subscription not found" or LastErrorMessage_s has "invalid subscription key"
| project TimeGenerated, CallerIpAddress_s, ApiId_s, OperationName_s, LastErrorMessage_s, SubscriptionId_g
| order by TimeGenerated desc
| take 50

SubscriptionId_g on those rows is the subscription entity on APIM, not the Azure subscription. It tells you which APIM subscription the caller’s key mapped to; if that ID corresponds to a product that does not include the API being called, you have found the misconfiguration.

Recipe 3: Circuit breaker trips

Clients suddenly start getting 503 responses. Backend is healthy (you can call it directly, it responds). BackendResponseCode_d on the failed requests is empty, meaning the gateway never sent the request to the backend. The gateway short-circuited.

You configured a circuit breaker rule on the backend entity, it tripped based on prior failures, and it is now rejecting requests for the configured trip duration. This is by design: the circuit breaker exists to stop hammering a backend that is clearly unhealthy. But if the breaker trips unexpectedly, or stays tripped longer than you meant it to, you need to know fast.

Query to find tripped breakers in the recent window:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.APIMANAGEMENT"
| where Category == "GatewayLogs"
| where ResponseCode_d == 503
| where LastErrorMessage_s has "CircuitBreaker" or ErrorReason_s has "CircuitBreaker"
| summarize
    trips = count(),
    first_trip = min(TimeGenerated),
    last_trip = max(TimeGenerated),
    sample = any(LastErrorMessage_s)
  by ApiId_s, BackendUrl_s
| order by last_trip desc

The sample column shows the first message from the matched rows, which is enough to tell you which rule tripped if you have multiple breakers on the same backend.

To correlate a trip with what happened before it:

let trip_time = datetime(2026-04-20T14:32:00Z);  // substitute
let backend = "my-backend";
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.APIMANAGEMENT"
| where Category == "GatewayLogs"
| where BackendUrl_s endswith backend
| where TimeGenerated between (trip_time - 10m .. trip_time + 1m)
| summarize count() by bin(TimeGenerated, 30s), ResponseCode_d, BackendResponseCode_d
| order by TimeGenerated asc

That gives you the response-code distribution in the 10 minutes leading up to the trip, broken down by both gateway response and backend response. If you see a cluster of 500s coming back from the backend, the breaker did its job. If you see nothing abnormal and it still tripped, the breaker threshold is probably too tight.

A fourth recipe that is really just a triage query

When you get paged with “something is broken” and you do not yet know which of the above it is, start here:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.APIMANAGEMENT"
| where Category == "GatewayLogs"
| where TimeGenerated > ago(15m)
| where ResponseCode_d >= 400
| summarize count() by ResponseCode_d, ErrorReason_s, LastErrorSource_s
| order by count_ desc

The distribution of (ResponseCode_d, ErrorReason_s, LastErrorSource_s) in the last 15 minutes tells you which of the three recipes applies before you drill into any of them. If most failures are ResponseCode_d=500, LastErrorSource_s=backend, you are in recipe 1 or a similar backend issue. If they are ResponseCode_d=401, LastErrorSource_s=configuration, recipe 2. If ResponseCode_d=503, LastErrorSource_s=backend with empty backend response codes, recipe 3.

This is the KQL equivalent of looking at the stack trace before reading the code. It keeps you from investigating the wrong failure mode for ten minutes.

Practical notes on running these queries

Diagnostic settings must be enabled on the APIM instance with GatewayLogs category routed to the workspace you are querying. Verify with az monitor diagnostic-settings list --resource <apim resource id>. A common cause of “my query returns nothing” is that logs were never being sent to the workspace you think they were.

Log ingestion has latency. Budget 2 to 10 minutes between an event happening and it appearing in AzureDiagnostics. During an incident this matters: the first query will often be empty not because nothing is broken but because the logs have not landed yet.

Column names use the KQL _s/_d/_g suffix convention for string, double, and guid. That is noise when you are writing queries but it is how the ingestion maps schemaless JSON fields to typed columns. Tab-completion in the portal query editor will show you the right names.

Save the queries. The portal query editor has a workspace-scoped query library; the first two of these live there for us along with workbook-wrapped versions for dashboards. Incident response is faster when the query is one click away instead of one retype away.

Azure docs: Monitor published APIs · APIM diagnostic logs reference · Circuit breaker on backends

Back to all posts