The Setup

I built a little Snake game with a twist: it's also a reinforcement-learning playground. The game a human plays and the game the AI trains against are literally the same code — one Gymnasium environment, SnakeEnv, shared by both. You hit TAB mid-game to hand control to the AI and back. The agent is a Deep Q-Network (DQN) from stable-baselines3, training on a tiny multi-layer perceptron against a 20×20 grid.

The whole thing is three commands:

python main.py play                    # play it yourself; TAB toggles the AI
python main.py train --timesteps N     # train a brain (headless, fast)
python main.py watch                   # watch a trained brain loop

That's the stage. Here's the problem.

You can checkout the project on github Here.

The Symptom: A Snake That Can't Stop Strangling Itself

I trained the agent hard — 2.5 million timesteps. And it learned... to chase food. That part it nailed. Drop an apple anywhere and it makes a beeline for it.

But watch it for thirty seconds and the flaw is obvious. It barrels toward the apple in a straight line, and when its own body is in the way, it reacts — it turns. But only by one cell. Then it tries to go for the apple again, hits the same wall of its own tail, backs off one cell, tries again. It wedges itself into corners and dead-ends constantly. For all those millions of training steps, it played like it had no idea its own tail existed.

My first instinct was the obvious one: train it longer. Make the network bigger. That instinct was wrong, and understanding why it was wrong is the whole point of this post.

The Diagnosis: It Wasn't Dumb, It Was Blind

A DQN learns a function from what it observes to which action is best. The ceiling on how smart it can possibly get is set by what's in that observation. If the information needed to make a good decision isn't in the input, no amount of training will conjure it. You can't learn to avoid a trap you can't perceive.

So I looked at what the agent could actually see. The observation was 11 numbers:

  • Danger — is there an obstacle in the single cell immediately ahead, to the right, or to the left? (Three yes/no flags.)
  • Heading — which of the four directions am I going? (One-hot.)
  • Food direction — is the apple up/down/left/right of my head? (Four flags.)

Read that list again with the bug in mind. The danger sensor looks exactly one cell in each direction. The agent has no information about where its tail is, how long it is, or — crucially — whether the direction it's about to commit to opens into free space or leads straight into an enclosed pocket.

It behaves exactly as you'd predict from that input: walk toward the food, and only flinch when a body segment becomes the immediately adjacent cell. By then it's already painted into the corner. It literally cannot see the trap forming — it can only feel the wall once its nose is against it.

This reframes everything. The agent wasn't under-trained. It was blind, and I'd spent 2.5 million steps perfecting blind reflexes.

The Fix: Give It Eyes, Not More Reps

If the problem is missing information, the fix is feature engineering — design a richer observation. I added three things, in rough order of impact.

1. Reachable Free Space (The Big One)

For each of the three moves the snake can make — straight, turn right, turn left — flood-fill from the cell it would land on and count how many cells are reachable. Normalize by the grid size. A move into a tiny 4-cell pocket scores near zero; a move into open territory scores high.

This is the feature that grants tail-awareness. The agent can now see a dead end before it walks into one, because the dead end shows up as a collapse in free space.

def _free_space(self, head, d, body):
    start = (head[0] + d[0], head[1] + d[1])
    sr, sc = start
    if not (0 <= sr < GRID_SIZE and 0 <= sc < GRID_SIZE) or start in body:
        return 0.0

    seen = {start}
    stack = [start]
    while stack:
        r, c = stack.pop()
        for nr, nc in ((r-1, c), (r+1, c), (r, c-1), (r, c+1)):
            if (0 <= nr < GRID_SIZE and 0 <= nc < GRID_SIZE
                    and (nr, nc) not in body and (nr, nc) not in seen):
                seen.add((nr, nc))
                stack.append((nr, nc))
    return len(seen) / (GRID_SIZE * GRID_SIZE)

2. Tail Direction

Where is the tail relative to the head (up/down/left/right)? This nudges the agent toward the classic Snake survival strategy: chase your own tail, because the cell your tail is leaving is always about to become safe.

3. Danger Distance Instead of a Danger Bit

The old sensor was binary: obstacle in the next cell, yes or no. I replaced it with a ray-cast: how many clear cells are there before the first obstacle, normalized. 0 means "wall is right there," higher means "clear runway ahead." Same three directions, but now with depth.

def _ray_danger(self, head, d, body):
    steps = 0
    r, c = head[0] + d[0], head[1] + d[1]
    while 0 <= r < GRID_SIZE and 0 <= c < GRID_SIZE and (r, c) not in body:
        steps += 1
        r += d[0]
        c += d[1]
    return steps / GRID_SIZE

I also tossed in a normalized snake length so the policy can learn to get more cautious as it grows. All told, the observation went from 11 floats to 19.

A couple of correctness details that matter: both the flood-fill and the ray-cast treat the body as snake[:-1] — the tail cell is excluded, because it vacates on the next tick and isn't a real obstacle. This matches the actual self-collision rule in the game's step function, so the agent's perception lines up with reality. The flood-fill is also an iterative stack rather than recursion, since the pure-Python environment is already the training bottleneck and the hot path needs to stay tight.

The Payoff

Changing the observation shape means the old brain is incompatible — different input size — so I retrained from scratch. Worth a quick note: the trainer defaults to resuming from the last checkpoint, so the first run blew up with a shape-mismatch error until I passed --fresh. Obvious in hindsight.

Then the result: 100,000 timesteps with the new observation beat the old 2.5-million-step agent. Roughly twice as good — at 1/25th the training.

The new snake threads through tight gaps, follows its tail, and fills the board instead of strangling itself in a corner. Same algorithm. Same tiny network. Same reward function. The only thing that changed was what the agent was allowed to see.

The Lesson

It's tempting — especially right now — to treat every "the model isn't smart enough" problem as a compute problem. Train longer. Scale the network. Throw more at it.

But compute can only optimize over the information you hand the model. 2.5 million steps couldn't teach the snake something its eyes couldn't report. The 25× speedup didn't come from a better optimizer or a bigger net. It came from a flood-fill — from deciding the agent should be able to see open space.

Representation is the lever. Before you scale the compute, ask whether the information the model needs is even in the input. Often it isn't, and the cheapest, highest-leverage fix in all of machine learning is just: let the model see the thing it needs to see.

The project is a Snake game + DQN agent that share a single Gymnasium environment, so the AI trains against exactly the game a human plays. Built in Python with stable-baselines3 and pygame-ce.