REINFORCEing A Brain

Published September 19, 2025

I think that there are ways that strengthening paths on a graph can be like reinforcement learning, and a reduction of how brains work is strengthening paths on a graph. In real life, brains are wayyyyyy more complex.

Start with a directed acyclic graph $G = (V, E)$ with edge weights $w: E \to \mathbb{R}$. Applying a topological sort, we get an ordered sequence of vertices. The $n$ input nodes are $I = \{v \in V : \text{in-degree}(v) = 0\}$ and the $m$ output nodes are $O = \{v \in V : \text{out-degree}(v) = 0\}$.

The graph operates by taking some input observation $\mathbb{R}^n$ and produces a one-hot vector $\mathbb{R}^m$ with the following algorithm:

Start with $\text{nodes} = I$
Each node $v \in \text{nodes}$ has a value $v_{v}$ and a bias $v_{b}$. It fires with probability $\text{sigmoid}(v_{v} + v_{b})$.
For each node $v \in \text{nodes}$ with outgoing edges to nodes in $u \in U$, we have a weight $w_{v, u}$. We update $u_{v}$ by $u_{v} := {u_v} + w_{v, u} \text{if } v \text{ fires}$
nodes := all nodes with incoming edges from $\text{nodes}$ or the node itself it it’s an output node
Repeat until $\text{nodes} = O$

We can then sample an action $a \in O$ by $\text{softmax}(O_v)$.

Inputs:

I1 I2 I3

Mode:

Output distribution (softmax over O):

Color: Node value (blue = low → red = high)

Yellow outline: Fired in last step

Edge color: Red (−), Gray (~0), Green (+)

Red bar: Sampled action (at end)

Note: Sampling is probabilistic

Let’s suppose that at the end, we receive a reward $r$. This allows us to update the weights of edges and the biases of nodes. Take the firing edges $E_f = \{ e(u, v) \text{if } u \text{ fired and } v \text{ fired} \}$. We can update the weights $e(u, v) := e(u, v) + \alpha \cdot r$.

We can then perform updates. We have two updates, one on the biases and one on the weights.

We update a neuron $v$’s bias with $v_b := v_b - \Beta[\rho - v_f]$ where $v_f$ is whether or not the neuron fires and $\rho$ is the time-averaged firing rate / probability of firing. We also update the weight $w(v, u) := \alpha \gamma A_u$ where $A_u$ is the advantage at node $u$, so it gets discounted by $\gamma$ as you propogate backwards.

K-Armed Bandit:

Arms Hidden depth Hidden width Alpha Beta Gamma

Stats

Output distribution