SDK architecture

Understanding W&B SDK’s event-driven architecture and how it handles logging internally

3 minute read

This guide explains how the W&B SDK handles logging internally through its event-driven architecture.

Overview

The W&B SDK uses an event-driven architecture designed to minimize impact on your training loops:

Non-blocking operations: W&B operates in a separate process with non-blocking function calls
Asynchronous data handling: The SDK buffers logging data and sends it asynchronously to avoid blocking your training
Background synchronization: A dedicated service handles uploading data to W&B servers without interrupting your code

Event-driven architecture

How W&B Handles Logging

When you call wandb.log(), here’s what happens under the hood:

Data Buffering: Your metrics are first written to an in-memory buffer
File Streaming: Data is periodically flushed from the buffer to local files in the wandb directory
Background Syncing: A separate process (wandb-service) handles uploading data to the W&B servers
Non-blocking Returns: The log() call returns immediately without waiting for uploads

# This call returns immediately - doesn't wait for server upload
wandb.log({"loss": 0.5, "accuracy": 0.92})

Architecture diagram

The following diagram illustrates the flow of data through W&B’s event-driven architecture:

flowchart TD
    A[Your Training Script] 
    B[Memory Buffer]
    C[Local Files<br/>wandb directory]
    D[wandb-service<br/>background process]
    E[W&B Servers]
    
    A -->|wandb.log()| B
    B -->|periodic flush| C
    C -->|async upload| D
    D -->|network sync| E
    
    classDef scriptNode fill:#ff99ff,stroke:#333,stroke-width:2px
    classDef serviceNode fill:#9999ff,stroke:#333,stroke-width:2px
    classDef serverNode fill:#99ff99,stroke:#333,stroke-width:2px
    classDef storageNode fill:#e8e8e8,stroke:#333,stroke-width:2px
    
    class A scriptNode
    class B,C storageNode
    class D serviceNode
    class E serverNode

Key components

Memory buffer

Stores metrics temporarily in RAM
Minimizes disk I/O operations
Automatically manages size to prevent memory issues

Local files

Persistent storage in the wandb directory
Ensures data isn’t lost even if the process crashes
Allows resuming uploads after network failures

wandb-service Process

Runs independently from your training script
Handles all network communication
Implements retry logic with exponential back-off
Manages authentication and API interactions

Network layer

Uploads data in batches for efficiency
Compresses data before transmission
Handles connection failures
Supports offline mode with automatic sync when reconnected

Process isolation

W&B achieves true non-blocking behavior through process isolation:

# Main training process
import wandb

wandb.init(project="my-project")

# This spawns a separate wandb-service process
# Your training continues without waiting

for epoch in range(epochs):
    # Training logic here
    loss = train_step()
    
    # This immediately returns - data is passed to wandb-service
    wandb.log({"loss": loss})

The wandb-service process handles:

File system operations
Network requests
Data compression
Error handling and retries

Data flow example

Here’s a practical example showing the complete data flow:

import wandb
import time

# Initialize W&B - spawns background process
run = wandb.init(project="architecture-demo")

# Simulate training loop
for step in range(100):
    # Your computation (e.g., neural network forward pass)
    loss = 0.5 - (step * 0.001)  # Simulated decreasing loss
    accuracy = 0.6 + (step * 0.002)  # Simulated increasing accuracy
    
    # Log metrics - returns immediately
    start_time = time.time()
    wandb.log({
        "loss": loss,
        "accuracy": accuracy,
        "step": step
    })
    log_time = time.time() - start_time
    
    print(f"Logging took {log_time*1000:.2f}ms")  # Typically < 1ms
    
    # Continue with training - no waiting for uploads
    time.sleep(0.1)  # Simulate training time

wandb.finish()

Benefits of the architecture

Performance: Training loops aren’t blocked by I/O operations
Reliability: Local storage ensures no data loss
Scalability: Can handle high-frequency logging through buffering
Flexibility: Works seamlessly in various environments (local, cloud, clusters)
Resilience: Continues logging even during network outages

SDK Performance Guidelines - Best practices for optimal logging performance
Configuration Reference - Detailed configuration options
W&B API Reference - Complete API documentation

Feedback

Was this page helpful?

Glad to hear it! If you have more to say, please let us know.

Sorry to hear that. Please tell us how we can improve.

Last modified August 29, 2025

Edit page Report issue PDF