pyjuice.optim

EM-based optimizers for training PCs. Unlike a torch.optim.Optimizer, a PC optimizer does not operate on nn.Parameter gradients: the backward pass accumulates parameter flows into the circuit, and the optimizer’s step consumes those flows to perform an EM update. The training loop keeps forward/backward in your own code:

opt = juice.optim.MiniBatchEM(pc, step_size = 0.1, pseudocount = 0.01)
for x in loader:
    lls = pc(x)
    lls.mean().backward()   # accumulates parameter flows
    opt.step()              # EM update, then resets the flow accumulator

class pyjuice.optim.CircuitOptimizer(pc: TensorCircuit, pseudocount: float = 0.0, keep_zero_params: bool = False, ddp: bool = False, ddp_dtype: dtype | None = None, ddp_group=None, sync_every: int = 1)

Base class for PC parameter optimizers.

A PC optimizer drives Expectation-Maximization (EM) training of a TensorCircuit. Unlike a torch.optim.Optimizer, it does NOT operate on nn.Parameter gradients: the “gradient” of a PC is its set of parameter flows, accumulated into pc.param_flows (and each input layer’s param_flows) by the backward pass. An optimizer’s step() consumes those flows to perform an EM update of pc.params.

The intended loop keeps forward/backward in your own code and lets the optimizer handle the update:

opt = juice.optim.MiniBatchEM(pc, step_size = 0.1, pseudocount = 0.01)
for x in loader:
    lls = pc(x)
    lls.mean().backward()   # accumulates flows into pc.param_flows
    opt.step()              # EM update, then resets the flow accumulator

step() resets the flow accumulator after each update, so you never need to call zero_flows() manually in the common case. Concrete optimizers (FullBatchEM, MiniBatchEM, Anemone) live in their own modules and define how the accumulated flows are turned into a parameter update.

Parameters:

pc (TensorCircuit) – the PC to optimize
pseudocount (float) – Laplace-smoothing pseudocount added to the parameter flows during the update
keep_zero_params (bool) – if True, parameters that are exactly zero stay zero (no pseudocount)
ddp (bool) – if True, all-reduce the parameter flows across the torch.distributed process group before every update (via TensorCircuit.sync_param_flows()). No-op when distributed is not initialized / world size is 1.
ddp_dtype (Optional[torch.dtype]) – optional reduce dtype for the DDP all-reduce (e.g. torch.bfloat16 to halve communication on bandwidth-bound interconnects); the stored flows stay float32.
ddp_group – optional torch.distributed process group for the all-reduce
sync_every (int) – DDP synchronization cadence, in EM updates. 1 (default) reduces the parameter flows every update – exact synchronous DDP. > 1 runs sync_every local EM updates per rank (each on its own data shard, no flow reduction) and then averages the parameters across ranks (Local-SGD): the all-reduce happens sync_every times less often. Only meaningful when ddp = True. NOTE: sync_every > 1 is a different optimizer than the synchronous one (averaging params after local updates is not the same as one update on averaged flows), so its convergence should be validated.

zero_flows(): Reset the parameter-flow accumulator (pc.param_flows and every input layer’s param_flows). Called automatically at the end of every step(), so it is rarely needed explicitly.

step(step_size: float | None = None)

Consume the accumulated parameter flows to perform one EM update, then reset the accumulator.

Parameters:: step_size (Optional[float]) – if given, overrides the optimizer’s default step size for this step only; pass a per-step value here to reproduce a learning-rate schedule without a scheduler.

class pyjuice.optim.FullBatchEM(pc: TensorCircuit, pseudocount: float = 0.0, keep_zero_params: bool = False, ddp: bool = False, ddp_dtype: dtype | None = None, ddp_group=None, sync_every: int = 1)

Full-batch EM. Accumulate parameter flows over the entire dataset, then perform a single exact EM M-step (step_size = 1.0). Use one step() per epoch:

opt = juice.optim.FullBatchEM(pc, pseudocount = 0.01)
for epoch in range(num_epochs):
    for x in loader:
        lls = pc(x)
        lls.mean().backward()   # flows accumulate over the whole epoch
    opt.step()                  # one exact EM update, then reset

See CircuitOptimizer for the constructor arguments.

step(step_size: float | None = None)

Consume the accumulated parameter flows to perform one EM update, then reset the accumulator.

Parameters:: step_size (Optional[float]) – if given, overrides the optimizer’s default step size for this step only; pass a per-step value here to reproduce a learning-rate schedule without a scheduler.

class pyjuice.optim.MiniBatchEM(pc: TensorCircuit, step_size: float = 0.1, niters_per_update: int = 1, pseudocount: float = 0.0, keep_zero_params: bool = False, ddp: bool = False, ddp_dtype: dtype | None = None, ddp_group=None, sync_every: int = 1)

Mini-batch EM. Perform an EM update every niters_per_update minibatches, blending the old and newly-estimated parameters by step_size:

opt = juice.optim.MiniBatchEM(pc, step_size = 0.1, pseudocount = 0.01)
for x in loader:
    lls = pc(x)
    lls.mean().backward()
    opt.step()

Parameters:

step_size (float) – EM step size in (0, 1]; params <- (1 - step_size) * params + step_size * new_params
niters_per_update (int) – number of minibatches to accumulate per EM update (default 1)

The remaining arguments are those of CircuitOptimizer.

step(step_size: float | None = None)

Consume the accumulated parameter flows to perform one EM update, then reset the accumulator.

Parameters:: step_size (Optional[float]) – if given, overrides the optimizer’s default step size for this step only; pass a per-step value here to reproduce a learning-rate schedule without a scheduler.

class pyjuice.optim.Anemone(pc: TensorCircuit, step_size: float = 0.4, momentum: float = 0.9, niters_per_update: int = 1, pseudocount: float = 1e-06, keep_zero_params: bool = False, ddp: bool = False, ddp_dtype: dtype | None = None, ddp_group=None, sync_every: int = 1)

The Anemone optimizer: scaled mini-batch EM with momentum.

It accumulates parameter flows over niters_per_update minibatches and then performs a flow-rescaled EM update (mini_batch_em(..., step_size_rescaling = True), i.e. the “mini_em_scaled” objective that normalizes by the accumulated flow mass). When momentum > 0, the accumulated flows are first passed through a bias-corrected exponential moving average (the same scheme as Adam-style momentum) before the update:

opt = juice.optim.Anemone(pc, step_size = 0.4, momentum = 0.9,
                          niters_per_update = 8, ddp = True)
for x in loader:
    lls = pc(x)
    lls.mean().backward()
    opt.step()              # fires the update every `niters_per_update` minibatches

The momentum is applied to both pc.param_flows and each input layer’s param_flows as:

f       <- (1 - momentum) * f
buffer  <- momentum * buffer + f
f       <- buffer / (1 - momentum ** (update_count + 1))   # bias correction

Parameters:

step_size (float) – EM step size in (0, 1] (used by the rescaled update)
momentum (float) – momentum coefficient in [0, 1); 0 disables momentum
niters_per_update (int) – number of minibatches to accumulate per EM update (default 1)

The remaining arguments are those of CircuitOptimizer (pseudocount defaults to 1e-6 here, matching typical Anemone training).

step(step_size: float | None = None)

Consume the accumulated parameter flows to perform one EM update, then reset the accumulator.

Parameters:: step_size (Optional[float]) – if given, overrides the optimizer’s default step size for this step only; pass a per-step value here to reproduce a learning-rate schedule without a scheduler.