Performance Tuning
Comprehensive guide to tuning QuicD for maximum performance. Learn how to optimize for throughput, latency, CPU usage, and memory consumption.
Performance Overview
Section titled “Performance Overview”QuicD is designed for high performance with:
- Zero-copy I/O: Direct buffer passing via io_uring
- CPU affinity: Pin workers to specific cores
- NUMA awareness: Allocate memory on correct NUMA node
- Multi-threading: Scale across all CPU cores
- eBPF routing: Hardware-accelerated connection steering
Typical Performance:
- Throughput: 10+ Gbps per worker on modern hardware
- Latency: Sub-millisecond connection setup
- Connections: 100K+ concurrent connections per instance
- CPU: Near-linear scaling up to core count
Quick Wins
Section titled “Quick Wins”Start with these high-impact optimizations:
# config.toml - High-performance baseline
[runtime]worker_count = 8 # Match physical CPU coresenable_cpu_affinity = true # Pin workers to cores
[netio]recv_buffer_size = 2097152 # 2MB receive bufferssend_buffer_size = 2097152 # 2MB send buffersbatch_size = 64 # Process 64 packets per batch
[quic]max_streams_bidi = 1000 # Increase for concurrent streamsinitial_max_data = 10485760 # 10MB initial flow controlmax_ack_delay = 10 # Reduce for low latencyExpected Impact: 2-3x throughput improvement over defaults
Worker Thread Tuning
Section titled “Worker Thread Tuning”Worker Count
Section titled “Worker Count”Rule of thumb: One worker per physical CPU core (not hyperthreads).
[runtime]worker_count = 8 # For 8-core CPUFinding optimal count:
# Test different worker countsfor workers in 4 8 16; do quicd --worker-count $workers & PID=$! # Run your benchmark kill $PIDdoneGuidelines:
- < Physical cores: Underutilization, lower throughput
- = Physical cores: Optimal for most workloads
- > Physical cores: Contention, context switching overhead
CPU Affinity
Section titled “CPU Affinity”Pin workers to specific cores to improve cache locality:
[runtime]enable_cpu_affinity = trueBenefits:
- Better L1/L2 cache hit rates
- Reduced context switching
- More predictable latency
Verify affinity:
# Check worker thread CPU bindingps -eLo pid,tid,psr,comm | grep quicdNUMA Configuration
Section titled “NUMA Configuration”For multi-socket systems, ensure memory is allocated on the correct NUMA node:
[netio]numa_aware = trueCheck NUMA topology:
numactl --hardwareManual NUMA binding (if needed):
# Bind to NUMA node 0numactl --cpunodebind=0 --membind=0 quicd --config config.tomlNetwork I/O Tuning
Section titled “Network I/O Tuning”Buffer Sizes
Section titled “Buffer Sizes”Larger buffers reduce syscalls but increase memory usage:
[netio]recv_buffer_size = 2097152 # 2MBsend_buffer_size = 2097152 # 2MBTuning:
- Throughput: Increase to 4-8MB for bulk transfers
- Latency: Decrease to 256-512KB for interactive traffic
- Memory: Each connection uses these buffers
System limits:
# Increase system UDP buffer limitssudo sysctl -w net.core.rmem_max=8388608sudo sysctl -w net.core.wmem_max=8388608Batch Processing
Section titled “Batch Processing”Process multiple packets per syscall:
[netio]batch_size = 64 # Process 64 packets at onceGuidelines:
- High throughput: 64-128
- Low latency: 16-32
- CPU-bound: Lower values reduce latency spikes
io_uring Configuration
Section titled “io_uring Configuration”Tune io_uring submission/completion queue sizes:
[netio]io_uring_entries = 4096 # SQ/CQ sizeGuidelines:
- High load: 4096-8192 entries
- Low load: 1024-2048 entries
- Memory: Each entry uses ~64 bytes
QUIC Protocol Tuning
Section titled “QUIC Protocol Tuning”Connection Limits
Section titled “Connection Limits”Control maximum concurrent connections:
[quic]max_connections = 100000Memory impact: Each connection uses ~10KB baseline + buffer sizes.
Per-connection limits:
max_streams_bidi = 1000 # Bidirectional streams per connectionmax_streams_uni = 1000 # Unidirectional streams per connectionFlow Control
Section titled “Flow Control”Tune initial flow control windows:
[quic]initial_max_data = 10485760 # 10MB connection windowinitial_max_stream_data_bidi = 1048576 # 1MB per bidi streaminitial_max_stream_data_uni = 1048576 # 1MB per uni streamGuidelines:
- High BDP networks: Increase to RTT × bandwidth
- Many small streams: Decrease per-stream, increase connection
- Few large streams: Increase per-stream
Calculate optimal window:
Window = RTT × BandwidthExample: 50ms × 1Gbps = 50ms × 125MB/s = 6.25MBCongestion Control
Section titled “Congestion Control”Choose congestion control algorithm:
[quic]congestion_control = "cubic" # or "bbr"Options:
- CUBIC: Default, good for most use cases
- BBR: Optimized for high-latency or loss-prone networks
ACK Handling
Section titled “ACK Handling”Tune ACK frequency and delay:
[quic]max_ack_delay = 10 # millisecondsGuidelines:
- Low latency: 5-10ms
- High throughput: 25ms (default)
- Satellite/high RTT: 50-100ms
Application-Level Optimization
Section titled “Application-Level Optimization”Stream Management
Section titled “Stream Management”Use unidirectional streams when possible:
// Faster - unidirectionallet mut send = handle.open_unidirectional_stream().await?;send.write_all(&data).await?;
// Slower - bidirectionallet (mut send, _recv) = handle.open_bidirectional_stream().await?;send.write_all(&data).await?;Reuse streams instead of opening many:
// Bad - many streamsfor item in items { let mut stream = handle.open_unidirectional_stream().await?; stream.write_all(&item).await?;}
// Good - one streamlet mut stream = handle.open_unidirectional_stream().await?;for item in items { stream.write_all(&item).await?;}Batching
Section titled “Batching”Batch small writes to reduce overhead:
// Bad - many small writesfor byte in data { stream.write_all(&[byte]).await?;}
// Good - batch writesstream.write_all(&data).await?;Async Task Spawning
Section titled “Async Task Spawning”Spawn tasks for concurrent streams:
while let Some(AppEvent::NewStream { stream_id, .. }) = events.next().await { let (send, recv) = handle.accept_bidirectional_stream().await?;
// Spawn task to handle concurrently tokio::spawn(async move { handle_stream(send, recv).await });}Memory Management
Section titled “Memory Management”Use zero-copy where possible:
// Avoid copyinglet data = recv.read_to_end().await?;send.write_all(&data).await?; // Copies data
// Better - stream directlylet mut buf = [0u8; 4096];loop { let n = recv.read(&mut buf).await?; if n == 0 { break; } send.write_all(&buf[..n]).await?;}Monitoring Performance
Section titled “Monitoring Performance”Built-in Metrics
Section titled “Built-in Metrics”Enable telemetry for performance monitoring:
[telemetry]enabled = truemetrics_port = 9090Key metrics:
quicd_packets_sent_total- Packet send ratequicd_bytes_sent_total- Throughputquicd_connections_active- Concurrent connectionsquicd_connection_handshake_duration_seconds- Handshake latency
Connection Stats
Section titled “Connection Stats”Query per-connection statistics:
let stats = handle.stats();println!("RTT: {:?}", stats.rtt);println!("Throughput: {} bytes/s", stats.bytes_sent);println!("Packet loss: {}%", (stats.packets_lost as f64 / stats.packets_sent as f64) * 100.0);System Monitoring
Section titled “System Monitoring”Monitor CPU usage:
top -H -p $(pgrep quicd)Monitor network utilization:
iftop -i eth0Monitor memory:
pmap -x $(pgrep quicd)Benchmarking
Section titled “Benchmarking”Connection Throughput
Section titled “Connection Throughput”Test maximum throughput per connection:
# Serverquicd --config high-throughput.toml
# Client (using example client)cargo run --release --example h3_client -- \ --url https://server:4433/large-file \ --connections 1Concurrent Connections
Section titled “Concurrent Connections”Test maximum concurrent connections:
# Load testing with many connectionswrk2 -t 8 -c 10000 -d 60s --latency https://server:4433/Request Latency
Section titled “Request Latency”Measure request latency distribution:
# Histogram of request latencieshey -n 100000 -c 100 https://server:4433/Performance Profiles
Section titled “Performance Profiles”High-Throughput Profile
Section titled “High-Throughput Profile”Optimize for maximum data transfer rate:
[runtime]worker_count = 16enable_cpu_affinity = true
[netio]recv_buffer_size = 8388608 # 8MBsend_buffer_size = 8388608batch_size = 128io_uring_entries = 8192
[quic]initial_max_data = 52428800 # 50MBinitial_max_stream_data_bidi = 10485760 # 10MBmax_ack_delay = 25congestion_control = "bbr"Use case: Large file transfers, video streaming, bulk data
Low-Latency Profile
Section titled “Low-Latency Profile”Optimize for minimum request latency:
[runtime]worker_count = 8enable_cpu_affinity = true
[netio]recv_buffer_size = 524288 # 512KBsend_buffer_size = 524288batch_size = 16io_uring_entries = 1024
[quic]initial_max_data = 1048576 # 1MBinitial_max_stream_data_bidi = 262144 # 256KBmax_ack_delay = 5congestion_control = "cubic"Use case: Interactive applications, gaming, API servers
High-Concurrency Profile
Section titled “High-Concurrency Profile”Optimize for maximum concurrent connections:
[runtime]worker_count = 32enable_cpu_affinity = true
[netio]recv_buffer_size = 262144 # 256KBsend_buffer_size = 262144batch_size = 32
[quic]max_connections = 500000max_streams_bidi = 100initial_max_data = 1048576initial_max_stream_data_bidi = 131072 # 128KBUse case: IoT gateways, connection brokers, massive multiplexing
System-Level Tuning
Section titled “System-Level Tuning”Linux Kernel Parameters
Section titled “Linux Kernel Parameters”Optimize system limits:
# Increase file descriptor limitfs.file-max = 2097152
# Increase network buffer limitsnet.core.rmem_max = 134217728 # 128MBnet.core.wmem_max = 134217728net.core.rmem_default = 16777216 # 16MBnet.core.wmem_default = 16777216
# Increase connection tracking tablenet.netfilter.nf_conntrack_max = 1048576
# Enable TCP/UDP tuningnet.ipv4.tcp_rmem = 4096 87380 134217728net.ipv4.tcp_wmem = 4096 65536 134217728net.ipv4.udp_mem = 102400 873800 16777216
# Disable IPv6 if not needednet.ipv6.conf.all.disable_ipv6 = 1Apply changes:
sudo sysctl -pResource Limits
Section titled “Resource Limits”Increase process limits:
* soft nofile 1048576* hard nofile 1048576* soft memlock unlimited* hard memlock unlimitedCPU Governor
Section titled “CPU Governor”Set CPU governor to performance mode:
# Set all CPUs to performance modefor cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance | sudo tee $cpudoneTroubleshooting Performance Issues
Section titled “Troubleshooting Performance Issues”Low Throughput
Section titled “Low Throughput”Symptoms: Below expected Gbps throughput
Diagnosis:
- Check CPU usage: Should be 70-90% per worker
- Check network utilization:
iftop -i eth0 - Check packet loss: Look at connection stats
Solutions:
- Increase buffer sizes
- Enable CPU affinity
- Increase batch_size
- Check for packet loss (tune congestion control)
High Latency
Section titled “High Latency”Symptoms: Slow request response times
Diagnosis:
- Check RTT in connection stats
- Look for high ACK delays
- Check for CPU saturation
Solutions:
- Reduce max_ack_delay
- Reduce batch_size
- Use low-latency profile
- Check system load
Memory Issues
Section titled “Memory Issues”Symptoms: High memory usage or OOM
Diagnosis:
- Check connection count
- Check buffer sizes
- Look for memory leaks (use valgrind)
Solutions:
- Reduce buffer sizes
- Limit max_connections
- Check application for leaks
CPU Saturation
Section titled “CPU Saturation”Symptoms: 100% CPU usage, low throughput
Diagnosis:
- Profile with
perf:perf record -g --call-graph dwarf -- quicd - Check for lock contention
- Look for hot loops
Solutions:
- Increase worker_count
- Enable CPU affinity
- Reduce per-worker load
Related Documentation
Section titled “Related Documentation”- Configuration Reference - All configuration options
- Architecture - System design and threading
- Errors - Troubleshooting errors
- API Reference - Application API
Performance Checklist
Section titled “Performance Checklist”Before deploying to production:
- Worker count matches physical CPU cores
- CPU affinity enabled
- Buffer sizes tuned for workload
- System limits increased (ulimit, sysctl)
- Telemetry enabled for monitoring
- Benchmarked under load
- Tested with expected traffic patterns
- CPU governor set to performance
- NUMA configured for multi-socket
- Application uses async/concurrent patterns