I think that there are ways that strengthening paths on a graph can be like reinforcement learning, and a reduction of how brains work is strengthening paths on a graph. In real life, brains are wayyyyyy more complex.
Start with a directed acyclic graph $G = (V, E)$ with edge weights $w: E \to \mathbb{R}$. Applying a topological sort, we get an ordered sequence of vertices. The $n$ input nodes are $I = \{v \in V : \text{in-degree}(v) = 0\}$ and the $m$ output nodes are $O = \{v \in V : \text{out-degree}(v) = 0\}$.
The graph operates by taking some input observation $\mathbb{R}^n$ and produces a one-hot vector $\mathbb{R}^m$ with the following algorithm:
We can then sample an action $a \in O$ by $\text{softmax}(O_v)$.
Let’s suppose that at the end, we receive a reward $r$. This allows us to update the weights of edges and the biases of nodes. Take the firing edges $E_f = \{ e(u, v) \text{if } u \text{ fired and } v \text{ fired} \}$. We can update the weights $e(u, v) := e(u, v) + \alpha \cdot r$.
We can then perform updates. We have two updates, one on the biases and one on the weights.
We update a neuron $v$’s bias with $v_b := v_b - \Beta[\rho - v_f]$ where $v_f$ is whether or not the neuron fires and $\rho$ is the time-averaged firing rate / probability of firing. We also update the weight $w(v, u) := \alpha \gamma A_u$ where $A_u$ is the advantage at node $u$, so it gets discounted by $\gamma$ as you propogate backwards.