This guide explains how CPUs, threads, NUMA, and SQL Server scheduling work; what the defaults are; which knobs you can turn; and what to measure when you do. It includes lightweight ASCII diagrams for NUMA and affinity masks, plus ready-to-run T-SQL.
1) What is a CPU?
A CPU executes instructions, performs arithmetic/logic, and orchestrates memory access. Modern servers ship with multiple cores; a core runs one thread at a time (with Hyper-Threading, typically two logical threads share resources). Performance depends both on compute (CPU-bound work) and I/O latency (waiting on disk/network).
- Core: physical execution unit.
- Logical processor: virtual execution unit via Hyper-Threading.
- Cache hierarchy: L1/L2/L3 caches keep data close to the core to cut memory latency.
2) How Windows & SQL Server schedule threads
2.1 Windows
- The OS scheduler decides which thread runs on which core and when.
- A context switch occurs when the CPU pauses one thread and resumes another (saving/restoring registers; potential cache evictions).
2.2 SQL Server
- SQL Server has a user-mode scheduler called SOS Scheduler per NUMA node.
- It manages SQL Workers (logical execution units). In the normal model there’s 1 Worker ↔ 1 OS thread.
- The OS ultimately runs the thread on a core.
├─ OS Thread #1 → Core 0
├─ OS Thread #2 → Core 1
└─ OS Thread #3 → Core 2
[SQL Server SOS Scheduler]
├─ Worker A ↔ OS Thread #1
├─ Worker B ↔ OS Thread #2
└─ Worker C ↔ OS Thread #3
Fiber (Lightweight Pooling) is legacy: multiple fibers (Workers) time-slice on a single OS thread. Heavy feature limits → not recommended.
3) SQL Server’s default CPU behavior
- No fixed split of cores into compute vs. I/O by default. Workers and I/O completion threads can run on any available core.
- SQL Server honors NUMA locality via per-node schedulers and worker pools.
- Modern Windows + SQL scheduling is sophisticated; manual core pinning rarely helps without special constraints.
4) NUMA & Soft-NUMA (with diagrams)
NUMA groups cores into nodes with local memory. Local memory access is faster than remote access across the interconnect.
SQL Server places schedulers, worker pools, and I/O completion threads per node to keep work local.
+-------------------------+ +-------------------------+ | NUMA Node 0 | | NUMA Node 1 | | Cores: 0 1 2 3 | link | Cores: 4 5 6 7 | | Local Memory: [M0] | <----> | Local Memory: [M1] | | ^ (fast local) | | ^ (fast local) | | remote → [M1] (slower) | | remote → [M0] (slower) | +-------------------------+ +-------------------------+
Soft-NUMA subdivides a physical node into logical nodes to scale scheduling and improve cache locality on large core counts. Recent SQL Server versions can enable it automatically in some configs.
5) CPU-related settings you can change
Category | Setting | Purpose | Typical use |
---|---|---|---|
Placement | ALTER SERVER CONFIGURATION SET PROCESS AFFINITY (Legacy: affinity mask , affinity I/O mask ) |
Restrict which CPUs/NUMA nodes SQL may use | Co-tenant isolation; explicit NUMA pinning; rare I/O-heavy isolation |
Parallelism | max degree of parallelism (MAXDOP) |
Cap threads per parallel plan | Lower for OLTP; higher for DW/analytics |
Parallelism | cost threshold for parallelism |
Which queries go parallel | Raise so only expensive queries parallelize |
Resource control | Resource Governor (MAX_CPU_PERCENT , etc.) |
Limit CPU share per workload group | Multi-tenant / mixed workloads |
Threads | max worker threads |
Upper bound for Workers | Usually default; adjust only if starvation is proven |
Legacy (avoid) | Lightweight Pooling / Priority Boost | Reduce OS switches / raise process priority | Not recommended on modern Windows/SQL |
6) Settings: details & recommended starting points
6.1 Process affinity (and legacy masks)
Goal: explicitly pin SQL to specific CPUs or NUMA nodes. Prefer PROCESS AFFINITY
over legacy bitmasks. Consider only for co-tenant isolation, licensing boundaries, or validated interrupt/compute collisions.
-- Safer, modern approach ALTER SERVER CONFIGURATION SET PROCESS AFFINITY CPU = 0 TO 15; -- or pin by NUMA node ALTER SERVER CONFIGURATION SET PROCESS AFFINITY NUMANODE = 0, 1;
Rules of thumb: if you split compute vs. I/O, avoid overlap between CPU sets; don’t starve Workers by over-restricting cores; always A/B test.
6.2 MAXDOP
What it does: caps threads per parallel plan. Parallelism helps big scans/joins/aggregations but can hurt OLTP latency if overused.
Scope options: You can set the degree of parallelism at the instance, database, and query level.
- Instance level (server-wide default):
EXEC sys.sp_configure 'show advanced options', 1; RECONFIGURE; EXEC sys.sp_configure 'max degree of parallelism', 4; RECONFIGURE;
- Database level (overrides instance for that DB):
ALTER DATABASE SCOPED CONFIGURATION SET MAXDOP = 4;
- Query level (overrides both):
SELECT ... FROM ... OPTION (MAXDOP 4);
- Starting points: OLTP: 1–4; mixed: 4–8; DW/analytics: align with cores within a NUMA node (often 8–16).
- Watch waits like
CXCONSUMER
/CXPACKET
andSOS_SCHEDULER_YIELD
when tuning.
6.3 Cost threshold for parallelism
What it does: estimated cost above which a query may go parallel. The default (5) is usually too low for modern hardware.
- Starting point: 50–100 on busy OLTP; lower if mid-sized analytics truly need parallelism.
EXEC sys.sp_configure 'cost threshold for parallelism', 80; RECONFIGURE;
6.4 Resource Governor (CPU)
Use RG to keep noisy neighbors from hogging CPU. Classify sessions and apply limits per group.
-- Illustrative, minimal example CREATE WORKLOAD GROUP wg_limited WITH (MAX_CPU_PERCENT = 30); ALTER RESOURCE GOVERNOR RECONFIGURE;
6.5 Max worker threads
Auto-sized by SQL based on CPU count. Don’t change unless you’ve confirmed worker starvation (many requests waiting with no workers) and understand the memory impact. Measure first with DMVs.
6.6 Legacy (avoid): Lightweight Pooling / Priority Boost
- Lightweight Pooling (fiber mode): user-mode fiber switching to reduce OS-level context switches; strong feature limitations; rarely beneficial now.
- Priority Boost: raises process priority at the OS level; risks destabilizing the system. Avoid.
7) Affinity masks: visual guide
7.1 Split compute vs. I/O without overlap
CPU row: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 ----------------------------------------------- Split: [W ] [W ] [W ] [W ] [IO] [IO] [ ] [ ] Overlap: [W+IO] on the same CPU → contention (don’t do this)
Tip: I/O completion threads are few and bursty. Default co-existence is fine for most systems. Isolation only pays off in extreme I/O + high CPU regimes—measure first.
7.2 Bitmask arithmetic (legacy masks)
CPU index: 7 6 5 4 3 2 1 0 Bit position: 7 6 5 4 3 2 1 0 (1=use, 0=ignore) Example A (Workers on CPU0–3): 0b00001111 = 0x0F = 15 Example B (I/O on CPU4–5): 0b00110000 = 0x30 = 48 -- On 33+ logical CPUs, use the *64 variants for upper bits as well.
7.3 Prefer NUMA pinning over bitmasks
ALTER SERVER CONFIGURATION SET PROCESS AFFINITY NUMANODE = 0, 1; -- Pins SQL to nodes 0 and 1 (and all CPUs within them)
8) Context switches & performance
Context switches are normal, but excessive switching increases cache misses and kernel transitions.
Overlapping compute & I/O on the same cores can raise switches; over-restricting cores can also backfire by reducing parallel capacity.
9) Diagnostics & benchmarking
9.1 DMVs
-- Schedulers / nodes / system info SELECT scheduler_id, cpu_id, status, is_online, is_idle FROM sys.dm_os_schedulers WHERE scheduler_id < 255; SELECT node_id, memory_node_id, online_scheduler_count FROM sys.dm_os_nodes; SELECT cpu_count, scheduler_count, numa_node_count FROM sys.dm_os_sys_info; -- Workers / Threads SELECT * FROM sys.dm_os_workers; SELECT * FROM sys.dm_os_threads; -- Waits (top) SELECT TOP 50 * FROM sys.dm_os_wait_stats ORDER BY wait_time_ms DESC; -- Current requests (check DOP) SELECT session_id, request_id, scheduler_id, dop, status FROM sys.dm_exec_requests;
9.2 PerfMon counters
- Process(sqlservr): % Processor Time, Context Switches/sec
- Processor(_Total): % Processor Time, Interrupts/sec
- System: Context Switches/sec, Processor Queue Length
- SQLServer:SQL Statistics: Batch Requests/sec, SQL Compilations/sec
10) Tuning playbook (step-by-step)
- Know the ground truth: cores, NUMA layout, workload type (OLTP/DW/mixed), SLAs.
- Parallelism first: raise cost threshold (e.g., 50–100); set MAXDOP for the workload (OLTP low, DW higher within a NUMA node).
- Resource isolation: on shared servers, use Resource Governor to cap noisy neighbors.
- Placement last: keep defaults unless you have hard constraints; if you split, avoid W/IO overlap and don’t over-constrain cores.
- Measure & iterate: baseline → change → re-measure throughput, latency, waits, and context switches. Roll back if gains are unclear.
- Avoid legacy switches: keep Lightweight Pooling and Priority Boost off.
11) FAQ
Should I use affinity to “reserve” cores for I/O?
Usually no. I/O completion threads are few and short-lived. Consider isolation only if you can demonstrate contention under extreme I/O and high CPU—and test carefully.
What MAXDOP is correct?
It depends. Start low for OLTP (1–4), higher for DW/analytics (often up to the cores within a NUMA node). Validate with waits and end-to-end latency.
Should I hand-tune Soft-NUMA?
Generally, let SQL Server manage it. Only adjust for special cases with A/B testing.
Will Priority Boost make SQL faster?
No—raising process priority can starve critical OS components (including I/O), often hurting SQL. Keep it disabled.
12) Appendix: DMV queries & counters
DMV snapshot
-- Core/NUMA layout & schedulers SELECT * FROM sys.dm_os_schedulers WHERE scheduler_id < 255; SELECT * FROM sys.dm_os_nodes; SELECT * FROM sys.dm_os_sys_info; -- Top waits SELECT TOP 50 * FROM sys.dm_os_wait_stats ORDER BY wait_time_ms DESC; -- Requests & DOP SELECT session_id, request_id, scheduler_id, dop, status FROM sys.dm_exec_requests;
PerfMon quick list
- Process(sqlservr) → % Processor Time, Context Switches/sec
- Processor(_Total) → % Processor Time, Interrupts/sec
- System → Context Switches/sec, Processor Queue Length
- SQLServer:SQL Statistics → Batch Requests/sec, SQL Compilations/sec