Three Iterations on a 2048 AI: What Representation and Reward Design Actually Do

The Setup

I built a 2048 AI using Deep Q-Learning (DQN) in PyTorch. The game itself is standard 2048 — a 4×4 grid, tiles that double when they collide, game over when no moves remain. You can play it yourself via a tkinter GUI, or let the AI run. The AI trains against a pure-Python environment with no display overhead, using 8 parallel environments to maximize throughput on a single CPU.

python -m game                          # play it yourself
python train.py                         # train the enhanced agent
python watch.py                         # watch a trained brain play
python benchmark.py                     # compare all checkpoints vs a rule-bot

The code contains three distinct training modes: standard, corner, and enhanced. Each one represents a deliberate upgrade to either the reward function, the state representation, or both. That progression is the point of this post.

You can check out the project on GitHub here.

Iteration 1: The Baseline — Raw Score, Flat State

The simplest possible framing: encode each tile as a single float and reward the agent for increasing the score.

The state is 16 floats — one per cell — where each value is log₂(tile) / 11. A blank cell is 0. A 2 is 0.09. A 2048 is 1.0. The network is a 3-layer MLP: 16 → 256 → 256 → 4 outputs (one Q-value per direction).

def _get_state(self):
    return [
        math.log2(v) / 11.0 if v > 0 else 0.0
        for row in self.board.grid
        for v in row
    ]

The reward is equally minimal: the raw score delta on each step. If the move produces no change, the agent takes a -10 penalty.

This agent learns something. It merges tiles. It doesn't make obviously stupid moves forever. But it gets stuck in a recognizable failure mode: it chases small merges greedily and lets the board fill up. It'll happily merge a pair of 2s and collect +4 while a 512 tile sits stranded in a corner with no room to grow. The raw score reward doesn't distinguish between useful merges (building toward 2048) and trivial ones (burning off low tiles), so the agent optimizes for the path of least resistance.

Iteration 2: Encoding a Human Heuristic — Corner Bias

Any human who's played 2048 seriously knows the corner strategy: keep your highest tile locked in one corner and build a monotonically decreasing chain away from it. This way, merges cascade efficiently and you never strand a large tile in the middle.

The corner mode encodes this directly into the reward. The state representation is unchanged (log2 floats, flat MLP), but the reward function gains a new term: a bonus equal to log₂(max_tile) whenever the max tile is in the target corner, and a penalty of the same magnitude if it leaves.

def _reward(self, prev_grid, _action, score_delta, changed):
    if not changed:
        return -10.0
    grid = self.board.grid
    max_val = max(v for row in grid for v in row)
    reward = float(score_delta)
    if grid[CORNER_ROW][CORNER_COL] == max_val:
        reward += math.log2(max_val)
    if prev_grid[CORNER_ROW][CORNER_COL] == prev_max and grid[CORNER_ROW][CORNER_COL] != max_val:
        reward -= math.log2(prev_max)
    return reward

This matters. The corner bonus scales with the tile value — keeping a 1024 in the corner earns 10 points worth of bonus per step, making it worth protecting. The agent does learn to anchor its largest tile.

But the state representation is still 16 compressed floats. The network can sense something about the board, but it can't easily reason about spatial patterns — which cells are adjacent, whether a row is monotonically ordered, whether there's a pocket of high tiles near the corner. We gave the agent a better target, but didn't give it better eyes.

Iteration 3: Richer Representation + Richer Reward

The enhanced mode changes both sides of the equation at once: how the board is encoded, and what the agent is rewarded for.

State: One-Hot Encoding Across 17 Channels

Instead of one float per cell, the enhanced state is a 17×16 vector — 17 channels, one per possible power of 2 (0 for blank, 1 for tile=2, 2 for tile=4, up to 16 for tile=65536), and 16 cells in the grid. Each cell gets exactly one 1.0 in the channel corresponding to its tile value.

def _state(self):
    state = [0.0] * (17 * 16)
    for r in range(4):
        for c in range(4):
            v = self.board.grid[r][c]
            idx = int(math.log2(v)) if v > 0 else 0
            state[idx * 16 + r * 4 + c] = 1.0
    return state

This gives the network a spatially explicit signal. Channel 9 says "there's a 512 tile, and it's in position (r, c)." The log2 encoding compresses that into a single ambiguous float — 0.82 — which the network has to decode back before it can reason about the value. One-hot removes that indirection.

The network architecture also changes to match. The enhanced model uses two convolutional layers before the MLP, so it can learn local spatial patterns directly from the 4×4 grid structure.

class DQNEnhanced(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(17, 128, kernel_size=2),   # (17,4,4) → (128,3,3)
            nn.ReLU(),
            nn.Conv2d(128, 128, kernel_size=2),  # (128,3,3) → (128,2,2)
            nn.ReLU(),
        )
        self.fc = nn.Sequential(
            nn.Linear(128 * 2 * 2, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 4),
        )

Reward: Four Components Instead of One

The enhanced reward function combines four signals, each targeting a different failure mode of the baseline:

Log-scaled merge reward. log₂(score_delta + 1) instead of raw score. This compresses the reward range so that merging a pair of 1024s (score delta: 2048) doesn't produce a reward 1000× larger than merging a pair of 2s (score delta: 4). The agent stops chasing small merges just because they're frequent.
Corner bonus. Carried over from the corner mode — keep the max tile anchored.
Monotonicity. For each row and column, count how many adjacent pairs are in non-decreasing or non-increasing order, and take the max. A perfectly ordered row scores 3; a chaotic one scores less. This nudges the agent toward the chain structure that makes merges cascade.
Empty cell bonus. +0.5 for each empty cell on the board. This is the one that prevents the most common early death: the board fills up because the agent never learned to leave room for new tiles.

def _reward(self, prev_grid, score_delta, changed):
    if not changed:
        return -10.0
    grid = self.board.grid
    reward = math.log2(score_delta + 1) if score_delta > 0 else 0.0
    if grid[CORNER_ROW][CORNER_COL] == max_val:
        reward += math.log2(max_val)
    reward += self._monotonicity(grid) * 0.1
    empty = sum(1 for row in grid for v in row if v == 0)
    reward += empty * 0.5
    return reward

The empty cell bonus deserves a moment. In 2048, every move spawns a new tile. Fill the board and you lose. The baseline agent, chasing score, has zero incentive to keep cells free — in fact, merging tiles looks strictly good from a score perspective, even if it means the board becomes a minefield. Adding empty * 0.5 gives the agent an intrinsic motivation to stay alive.

The Benchmark

The project includes a benchmark runner that pits all saved checkpoints against a simple rule-based bot (always try: up → left → right → down) over N games. The rule-bot is a useful floor — it's dumb but consistent, and any trained agent that can't beat it reliably isn't worth deploying.

# All brains in brains/ vs rule-bot, 100 games each
python benchmark.py

# Specific checkpoints, more games
python benchmark.py --games 500 brains/trained_brain_enhanced.pth

Having a rule-based baseline matters. It forces an honest comparison and reveals the cost of training instability — a partially converged DQN can easily score worse than the greedy heuristic on max tile distribution even when its average score looks acceptable. If your AI can't beat "always go up first," something is wrong with either the training or the reward.

The Lessons

Reward design is the hardest part

The raw score seems like the natural reward for 2048 — it's literally the game's objective. But it's deceptive. It treats every merge as equally good, rewards filling the board (because that maximizes merge opportunities per game), and says nothing about whether the board state is survivable.

Each component of the enhanced reward fixes a specific failure mode that the baseline agent exhibits. Designing those components required understanding why the baseline was failing, not just that it was failing. The empty cell bonus, in particular, is the kind of thing that's obvious in hindsight and invisible until you watch an agent die because it optimized itself into a corner.

Representation sets the ceiling

This is the same lesson from the Snake AI post: you can't learn what you can't see. The log2 flat encoding is lossy. From channel 9 (the 512 channel), a CNN can immediately identify that there's a 512 in the bottom-left corner and a 256 adjacent to it — potential merge target, right there, spatially explicit. From a single float 0.82 at position 12 in a vector of 16, extracting that same spatial relationship requires the MLP to first invert the encoding, then reconstruct which positions are adjacent, then reason about what the values mean.

One-hot encoding isn't a trick. It's making the information legible to the architecture you're using.

Human heuristics encode well into rewards — with caveats

The corner strategy is real and the corner bonus works. But it also creates a rigid target: the agent learns to protect a specific corner, not to generalize the principle of "keep large tiles adjacent and ordered." A sufficiently trained agent could learn the right behavior without the corner reward, if given the monotonicity signal and a good representation. The corner bonus is training wheels — useful for getting early signal, potentially limiting for long-run generalization.

Parallel environments are basically free throughput

Running 8 parallel environments means 8× the experience per gradient step at essentially no extra cost in a CPU-bound, pure-Python game. The experience replay buffer fills faster, episodes that get stuck in degenerate states don't block training, and epsilon-greedy exploration is more diverse. This is a practical engineering detail, not a research insight — but in personal projects where training time is the bottleneck, it matters.

What's Next

The architectures here are small — the CNN is two 2×2 conv layers. Deeper networks, larger kernels, or attention mechanisms that can explicitly model tile interactions across the full board are obvious next directions. Curriculum learning — starting the agent on boards that already have high tiles and gradually increasing difficulty — might also address the sparse reward problem that emerges when the agent rarely gets to see a 1024, let alone a 2048.

But that's future work. The point of this project was to build intuition for the levers that actually move the needle in RL: what the agent can observe, and what it's incentivized to do.

The project is a 2048 game with a DQN AI trainer, built in Python with PyTorch. It supports GPU (CUDA and Apple Silicon MPS), 8 parallel training environments, visual and text watch modes, and a benchmark runner for comparing checkpoints against a rule-based bot.