PWNagotchi · Volume 6
PWNagotchi Volume 6 — The A2C Reinforcement-Learning Agent
What is and isn't AI here, what the actor and critic networks actually look at, and what the hyperparameters in config.toml mean
The Pwnagotchi is marketed as an “AI device that learns to capture Wi-Fi handshakes.” Vol 1 §6 established that the framing is roughly half-true. This volume makes precise what the agent does, what it doesn’t do, and gives the reader enough hold on the mechanics to (a) read the agent code, (b) tune hyperparameters intelligently, and (c) decide whether to run AI mode at all.
The agent implements Advantage Actor-Critic (A2C), a 2016-vintage on-policy reinforcement-learning algorithm in the actor-critic family. Specifically, the original implementation uses Keras-on-TensorFlow with a small MLP for both actor and critic, on top of the stable-baselines library (jayofelony’s fork has switched to stable-baselines3, which is the PyTorch successor).
The agent’s role is parameter tuning, not strategy invention. The set of attacks (PMKID solicitation; targeted deauth → re-association handshake capture; opportunistic full-channel scan) is fixed in code. The agent learns when and how often to issue each attack against each AP, optimizing a reward signal driven by handshake yield.
This is a legitimate use of reinforcement learning. It is also a relatively gentle problem — the action space is small, the reward is sparse-but-not-pathologically-so, and the state space is well-engineered.
2. RL primer in 90 seconds
Reinforcement learning is the framework where an agent interacts with an environment by issuing actions that change the environment’s state, occasionally receiving a reward signal. The agent’s goal is to learn a policy π(a|s) — a probability distribution over actions conditional on the current state — that maximizes cumulative discounted reward.
The four canonical algorithm families:
| Family | Idea | When to use | Pwnagotchi? |
|---|---|---|---|
| Value-based (Q-learning, DQN) | Learn the value of every state-action pair; act greedily | Discrete action spaces; offline-OK | Not used |
| Policy-gradient (REINFORCE) | Learn the policy directly via gradient ascent on expected reward | Continuous actions; online | Foundation of A2C |
| Actor-critic (A2C, A3C, PPO) | Hybrid — actor outputs action, critic estimates state value, advantage = reward - critic | Anything; stable; standard go-to since ~2016 | Yes — A2C |
| Model-based (Dyna, MuZero) | Learn a model of the environment + plan over it | Sample-efficient; harder to implement | Not used |
A2C, specifically:
- Actor network: outputs a probability distribution over discrete actions (or a parameterized distribution over continuous actions). For Pwnagotchi: discrete — which channel to hop to, how aggressive to be in deauth, etc.
- Critic network: outputs a scalar estimate of how good it is to be in the current state — V(s).
- Advantage: A(s,a) = R + γ·V(s’) − V(s) — “did this action turn out better than the critic expected the average action would?”
- Loss for the actor: −log π(a|s) · A(s,a) (push up the probability of actions that turned out better than expected)
- Loss for the critic: (R + γ·V(s’) − V(s))² (just regress V against bootstrapped returns)
Both networks are typically small MLPs (multilayer perceptrons) with shared early layers in the most efficient implementations. The Pwnagotchi uses Keras’ default MlpPolicy from stable-baselines — two hidden layers, 64 units each, tanh activations.

Figure 2.1 — RL agent-environment loop. Via Wikimedia Commons.
3. The state representation
What the Pwnagotchi agent observes — its state vector — is engineered (not raw sensor data). The state is a fixed-length vector built from the current bettercap snapshot. The fields:
| Feature group | Length | Computation |
|---|---|---|
| AP density per channel | 14 (one per 2.4 GHz channel 1-14) | Count of unique BSSIDs seen on each channel in the last N seconds |
| Client density per channel | 14 | Count of unique client MACs seen on each channel |
| Handshake yield per channel | 14 | Captures per minute per channel, smoothed |
| Current mode / state machine | ~8 (one-hot) | What the gotchi is “feeling” — bored, sad, excited, lonely, etc. |
| Time-of-day | ~4 (sin/cos of clock + day-of-week one-hots) | Coarse temporal context — Wi-Fi traffic patterns vary by time |
| Battery state | 2 | (Voltage normalized, time-since-last-charge normalized) — only if PiSugar plugin |
| Peer count | 1 | How many pwngrid peers seen in last hour |
| Total | ~57 dims | Concatenated; fed into the actor and critic MLPs |
The state is sampled every ~1 second (the daemon main-loop tick). The agent updates its policy every N samples (N = params.batch_size = 50 by default), accumulating gradient over a small rollout.
4. The action space
What the agent chooses — the action — is a tuple of bettercap parameters to apply for the next interval:
| Action component | Type | Range / values |
|---|---|---|
| Channel | Discrete | One of (1, 2, 6, 11, …, 14) — common 2.4 GHz channels, or “hop all” |
| Deauth aggression | Discrete | (off, gentle, medium, aggressive) — controls deauth burst size + cadence |
| PMKID solicitation rate | Discrete | (off, occasional, frequent) — how often to fire wifi.assoc per AP |
| Channel dwell time | Continuous (binned) | 1 / 5 / 15 / 60 seconds — how long to stay before reconsidering |
| Quiet listen mode | Boolean | If on, no transmissions for this interval — pure passive listening |
That’s ~5 discrete dimensions, each with 2-15 options. Total action space size ≈ a few hundred unique actions. Small enough for a tabular Q-learner; the A2C is overkill for the size — but the continuous-time tradeoffs (when to switch, how to weight a long dwell vs a quick hop) are what make actor-critic worthwhile here.
5. The reward signal
The reward function:
r_t = ( handshakes captured in last second )
+ ( new APs seen in last second × 0.2 )
+ ( new clients seen in last second × 0.1 )
+ ( new pwngrid peers in last second × 0.5 )
- ( channels with zero activity dwelled-on × 0.05 )
Sparse. Most timesteps the reward is 0. A captured handshake — reward of 1. A new AP seen — 0.2. A new peer — 0.5. The negative term gently discourages dwelling on dead channels.
This sparseness is the practical challenge of training the agent. The Pwnagotchi takes weeks to converge in a heterogeneous environment. In a static environment (the same office, every day) the agent converges in days.
6. The hyperparameters in config.toml
The [personality.ai] block in /etc/pwnagotchi/config.toml:
[personality.ai]
enabled = true
path = "/root/brain.nn"
params.policy = "MlpPolicy" # the only one wired
params.learning_rate = 0.0001
params.batch_size = 50
params.workers = 2 # parallel rollout workers
params.n_steps = 50 # steps per rollout before update
params.gamma = 0.99 # discount factor
params.ent_coef = 0.01 # entropy bonus (encourages exploration)
params.vf_coef = 0.5 # critic loss weight
params.max_grad_norm = 0.5 # gradient clipping
Tuning intuition:
| Knob | Higher value → | Lower value → | When to change |
|---|---|---|---|
learning_rate | Faster but less stable updates | Slower, more stable | Default fine. If you see large oscillations in capture rate, halve. |
batch_size | Smoother updates | Faster updates, noisier | Default fine. Larger if you’ve got a lot of training time. |
gamma | Cares about far-future reward | Cares about immediate reward | Default 0.99 is right for capture; lower if you want it to chase short-term yield aggressively |
ent_coef | More exploration | Greedier exploitation | Increase if the agent gets stuck in a parameter rut |
n_steps | More data per update | More frequent updates | Default fine |
Honestly, for most users the defaults work. The hyperparameters were tuned by the community over thousands of deployments. Don’t fiddle unless you know what behavior change you’re trying to engineer.
7. The “is this really AI” debate, settled
It’s RL, and RL is AI under any reasonable definition. The agent learns from interaction with the environment, it generalizes (within the limits of its small network), it converges to a policy that is measurably better than random within ~weeks of deployment, and it transfers somewhat across environments (a trained brain.nn that has seen a busy office does better in a busy cafe than a freshly-initialized agent).
It is not “AI” in the sense the term is used in 2026 marketing — there is no large pre-trained model, no language understanding, no transformer, no foundation model. It is a small, classical RL agent with a hand-engineered state space, doing a narrow task.
The question that actually matters is: does it beat a well-tuned static config? The community evidence is:
- AUTO mode (no RL — fixed-tuning autonomous): captures ~70-80% as many handshakes per day as AI mode in a busy environment.
- MANU mode (operator-driven static config tuned for the local environment): captures roughly the same as AI mode if the operator tunes once.
- The AI mode wins when the environment changes — the device moves between locations, between busy and quiet, day to night.
Net: AI mode is worth it if your gotchi moves around and the environment varies. AUTO mode is fine for a stationary gotchi.
8. Reading the agent code
The agent code lives at /usr/local/share/pwnagotchi/pwnagotchi/ai/ in the jayofelony install:
ai/
├── __init__.py # AI / non-AI mode dispatch
├── agent.py # the core agent class — wraps stable-baselines
├── featurizer.py # the state-vector construction (§3 above)
├── reward.py # the reward function (§5 above)
├── parameter.py # the action-space definition (§4 above)
└── train.py # the train() loop called from the daemon
The whole agent is ~600 lines of Python spread across these files. Most is glue — wrapping stable-baselines3’s PPO/A2C class, marshalling the bettercap RPC response into the state vector, mapping actions back to bettercap commands. The actual RL math is a single call into stable_baselines3.
Patterns you’ll want to know if you read or modify:
- The agent runs in the main daemon process — there is no separate AI worker. Inference + training happen on the Pi Zero’s CPU (which on the Zero 2 W is a 4-core A53; fast enough).
- Model saves happen every 10 minutes by default (configurable). The model file is
/root/brain.nn— a pickled stable-baselines3 model, ~few MB. - The agent does not continue training during a software upgrade — when the daemon restarts it loads the saved model and resumes inference, but doesn’t auto-train across a model-version boundary. If you upgrade jayofelony from an older to newer release that changes the network architecture, you must delete
/root/brain.nnand start fresh.
9. Failure modes
The agent can pathology in a few ways:
| Pathology | Symptom | Cause | Fix |
|---|---|---|---|
| Stuck on one channel | Captures stop growing despite a busy environment | Local maximum in the policy — agent thinks channel 6 is “always the answer” | Increase ent_coef, or delete /root/brain.nn and re-train fresh |
| Constantly hopping | Captures low; the gotchi looks frenetic | Critic underestimating value of dwell — agent never sees enough reward to commit | Increase n_steps, lower learning_rate |
| Trains to “do nothing” | Capture rate ~0 despite a busy environment | Reward function imbalance — quiet-channel penalty is being underestimated and agent chooses quiet-listen mode constantly | Tune the reward weights in /usr/local/share/pwnagotchi/pwnagotchi/ai/reward.py (you have to fork; this isn’t surfaced in config.toml) |
| brain.nn corruption | Daemon crashes at load with a pickle error | SD card corruption + the pickled model is unfixable | Delete /root/brain.nn |
| Training crashes on Pi Zero W | Out-of-memory at agent.learn() call | The original Zero W has 512 MB and the stable-baselines3 training step doesn’t fit | Upgrade to Pi Zero 2 W, or use MANU/AUTO mode |
10. Should you train, or load a community model?
The community has historically shared trained brain.nn files — “here’s the brain of a Pwnagotchi that ran in a Berlin coffee shop for six months.” These are interesting as artifacts but not very useful as starting points, because the policy is environment-specific. A coffee-shop-trained brain in a suburban garage will mostly act lost for the first week.
The recommended path:
- Start fresh — let the agent initialize randomly.
- Run AI mode for ~2 weeks in your typical environment.
- If after 2 weeks the capture rate is below what AUTO mode achieves in the same spot, your environment isn’t varied enough to benefit from RL — switch to AUTO.
11. The PPO option (advanced)
Stable-baselines3 also ships PPO (Proximal Policy Optimization) — a more modern, more sample-efficient algorithm in the same actor-critic family. Some community forks have swapped A2C for PPO. The change is small (a one-line constructor change). Anecdotal reports: PPO converges faster and to a marginally better policy in noisy environments, at the cost of slightly higher per-step CPU.
If you’re modifying the agent, PPO is the obvious modern alternative. For 95% of users, the default A2C is fine.
12. Cheatsheet updates from this volume
Items to roll into Vol 12 (laminate-ready cheatsheet):
- “AI mode = A2C in stable-baselines3, tuning bettercap params.” (§1, §2)
- “State vector ≈ 57 dims (channel densities + handshake yields + temporal context).” (§3)
- “Action ≈ (channel, deauth aggression, PMKID rate, dwell time, quiet mode).” (§4)
- “Train fresh — community brains are environment-specific.” (§10)
- “AI > AUTO if the gotchi moves around. AUTO is fine if it stays put.” (§7)
- “If stuck: increase ent_coef in [personality.ai].” (§9)