This demo trains a deeper stochastic network (no backprop) on a $K$-armed bandit using the neural update rules described earlier:
Bias update: For each neuron $v$ with firing indicator $v_f \in {0,1}$ and EMA firing-rate estimate $\rho$, update bias as $v_b := v_b - \Beta(\rho - v_f)$.
Weight update: For each edge $(u\to v)$ that was active during a trajectory, update $w(u,v) := w(u,v) + \alpha,\gamma^{d},A_u$ where $A_u$ is an advantage assigned at node $u$ and $d$ is the backward distance discount.
We interact with a bandit environment with $K$ arms and Bernoulli rewards. The network outputs an action distribution via softmax over output nodes.