PWNagotchi · Volume 6

PWNagotchi Volume 6 — The A2C Reinforcement-Learning Agent

What is and isn't AI here, what the actor and critic networks actually look at, and what the hyperparameters in config.toml mean

The Pwnagotchi is marketed as an “AI device that learns to capture Wi-Fi handshakes.” Vol 1 §6 established that the framing is roughly half-true. This volume makes precise what the agent does, what it doesn’t do, and gives the reader enough hold on the mechanics to (a) read the agent code, (b) tune hyperparameters intelligently, and (c) decide whether to run AI mode at all.

The agent implements Advantage Actor-Critic (A2C), a 2016-vintage on-policy reinforcement-learning algorithm in the actor-critic family. Specifically, the original implementation uses Keras-on-TensorFlow with a small MLP for both actor and critic, on top of the stable-baselines library (jayofelony’s fork has switched to stable-baselines3, which is the PyTorch successor).

The agent’s role is parameter tuning, not strategy invention. The set of attacks (PMKID solicitation; targeted deauth → re-association handshake capture; opportunistic full-channel scan) is fixed in code. The agent learns when and how often to issue each attack against each AP, optimizing a reward signal driven by handshake yield.

This is a legitimate use of reinforcement learning. It is also a relatively gentle problem — the action space is small, the reward is sparse-but-not-pathologically-so, and the state space is well-engineered.

2. RL primer in 90 seconds

Reinforcement learning is the framework where an agent interacts with an environment by issuing actions that change the environment’s state, occasionally receiving a reward signal. The agent’s goal is to learn a policy π(a|s) — a probability distribution over actions conditional on the current state — that maximizes cumulative discounted reward.

The four canonical algorithm families:

FamilyIdeaWhen to usePwnagotchi?
Value-based (Q-learning, DQN)Learn the value of every state-action pair; act greedilyDiscrete action spaces; offline-OKNot used
Policy-gradient (REINFORCE)Learn the policy directly via gradient ascent on expected rewardContinuous actions; onlineFoundation of A2C
Actor-critic (A2C, A3C, PPO)Hybrid — actor outputs action, critic estimates state value, advantage = reward - criticAnything; stable; standard go-to since ~2016Yes — A2C
Model-based (Dyna, MuZero)Learn a model of the environment + plan over itSample-efficient; harder to implementNot used

A2C, specifically:

  • Actor network: outputs a probability distribution over discrete actions (or a parameterized distribution over continuous actions). For Pwnagotchi: discrete — which channel to hop to, how aggressive to be in deauth, etc.
  • Critic network: outputs a scalar estimate of how good it is to be in the current state — V(s).
  • Advantage: A(s,a) = R + γ·V(s’) − V(s) — “did this action turn out better than the critic expected the average action would?”
  • Loss for the actor: −log π(a|s) · A(s,a) (push up the probability of actions that turned out better than expected)
  • Loss for the critic: (R + γ·V(s’) − V(s))² (just regress V against bootstrapped returns)

Both networks are typically small MLPs (multilayer perceptrons) with shared early layers in the most efficient implementations. The Pwnagotchi uses Keras’ default MlpPolicy from stable-baselines — two hidden layers, 64 units each, tanh activations.

A conceptual diagram of reinforcement learning — agent emits action, environment returns next state + reward, agent updates its policy. Pwnagotchi's specific A2C implementation is a particular shap…
A conceptual diagram of reinforcement learning — agent emits action, environment returns next state + reward, agent updates its policy. Pwnagotchi's specific A2C implementation is a particular shape of this skeleton with the environment = the local Wi-Fi airspace, actions = bettercap parameters, reward = captured handshakes.

Figure 2.1 — RL agent-environment loop. Via Wikimedia Commons.

3. The state representation

What the Pwnagotchi agent observes — its state vector — is engineered (not raw sensor data). The state is a fixed-length vector built from the current bettercap snapshot. The fields:

Feature groupLengthComputation
AP density per channel14 (one per 2.4 GHz channel 1-14)Count of unique BSSIDs seen on each channel in the last N seconds
Client density per channel14Count of unique client MACs seen on each channel
Handshake yield per channel14Captures per minute per channel, smoothed
Current mode / state machine~8 (one-hot)What the gotchi is “feeling” — bored, sad, excited, lonely, etc.
Time-of-day~4 (sin/cos of clock + day-of-week one-hots)Coarse temporal context — Wi-Fi traffic patterns vary by time
Battery state2(Voltage normalized, time-since-last-charge normalized) — only if PiSugar plugin
Peer count1How many pwngrid peers seen in last hour
Total~57 dimsConcatenated; fed into the actor and critic MLPs

The state is sampled every ~1 second (the daemon main-loop tick). The agent updates its policy every N samples (N = params.batch_size = 50 by default), accumulating gradient over a small rollout.

4. The action space

What the agent chooses — the action — is a tuple of bettercap parameters to apply for the next interval:

Action componentTypeRange / values
ChannelDiscreteOne of (1, 2, 6, 11, …, 14) — common 2.4 GHz channels, or “hop all”
Deauth aggressionDiscrete(off, gentle, medium, aggressive) — controls deauth burst size + cadence
PMKID solicitation rateDiscrete(off, occasional, frequent) — how often to fire wifi.assoc per AP
Channel dwell timeContinuous (binned)1 / 5 / 15 / 60 seconds — how long to stay before reconsidering
Quiet listen modeBooleanIf on, no transmissions for this interval — pure passive listening

That’s ~5 discrete dimensions, each with 2-15 options. Total action space size ≈ a few hundred unique actions. Small enough for a tabular Q-learner; the A2C is overkill for the size — but the continuous-time tradeoffs (when to switch, how to weight a long dwell vs a quick hop) are what make actor-critic worthwhile here.

5. The reward signal

The reward function:

   r_t = ( handshakes captured in last second )
       + ( new APs seen in last second × 0.2 )
       + ( new clients seen in last second × 0.1 )
       + ( new pwngrid peers in last second × 0.5 )
       - ( channels with zero activity dwelled-on × 0.05 )

Sparse. Most timesteps the reward is 0. A captured handshake — reward of 1. A new AP seen — 0.2. A new peer — 0.5. The negative term gently discourages dwelling on dead channels.

This sparseness is the practical challenge of training the agent. The Pwnagotchi takes weeks to converge in a heterogeneous environment. In a static environment (the same office, every day) the agent converges in days.

6. The hyperparameters in config.toml

The [personality.ai] block in /etc/pwnagotchi/config.toml:

[personality.ai]
enabled = true
path = "/root/brain.nn"
params.policy = "MlpPolicy"            # the only one wired
params.learning_rate = 0.0001
params.batch_size = 50
params.workers = 2                     # parallel rollout workers
params.n_steps = 50                    # steps per rollout before update
params.gamma = 0.99                    # discount factor
params.ent_coef = 0.01                 # entropy bonus (encourages exploration)
params.vf_coef = 0.5                   # critic loss weight
params.max_grad_norm = 0.5             # gradient clipping

Tuning intuition:

KnobHigher value →Lower value →When to change
learning_rateFaster but less stable updatesSlower, more stableDefault fine. If you see large oscillations in capture rate, halve.
batch_sizeSmoother updatesFaster updates, noisierDefault fine. Larger if you’ve got a lot of training time.
gammaCares about far-future rewardCares about immediate rewardDefault 0.99 is right for capture; lower if you want it to chase short-term yield aggressively
ent_coefMore explorationGreedier exploitationIncrease if the agent gets stuck in a parameter rut
n_stepsMore data per updateMore frequent updatesDefault fine

Honestly, for most users the defaults work. The hyperparameters were tuned by the community over thousands of deployments. Don’t fiddle unless you know what behavior change you’re trying to engineer.

7. The “is this really AI” debate, settled

It’s RL, and RL is AI under any reasonable definition. The agent learns from interaction with the environment, it generalizes (within the limits of its small network), it converges to a policy that is measurably better than random within ~weeks of deployment, and it transfers somewhat across environments (a trained brain.nn that has seen a busy office does better in a busy cafe than a freshly-initialized agent).

It is not “AI” in the sense the term is used in 2026 marketing — there is no large pre-trained model, no language understanding, no transformer, no foundation model. It is a small, classical RL agent with a hand-engineered state space, doing a narrow task.

The question that actually matters is: does it beat a well-tuned static config? The community evidence is:

  • AUTO mode (no RL — fixed-tuning autonomous): captures ~70-80% as many handshakes per day as AI mode in a busy environment.
  • MANU mode (operator-driven static config tuned for the local environment): captures roughly the same as AI mode if the operator tunes once.
  • The AI mode wins when the environment changes — the device moves between locations, between busy and quiet, day to night.

Net: AI mode is worth it if your gotchi moves around and the environment varies. AUTO mode is fine for a stationary gotchi.

8. Reading the agent code

The agent code lives at /usr/local/share/pwnagotchi/pwnagotchi/ai/ in the jayofelony install:

ai/
├── __init__.py             # AI / non-AI mode dispatch
├── agent.py                # the core agent class — wraps stable-baselines
├── featurizer.py           # the state-vector construction (§3 above)
├── reward.py               # the reward function (§5 above)
├── parameter.py            # the action-space definition (§4 above)
└── train.py                # the train() loop called from the daemon

The whole agent is ~600 lines of Python spread across these files. Most is glue — wrapping stable-baselines3’s PPO/A2C class, marshalling the bettercap RPC response into the state vector, mapping actions back to bettercap commands. The actual RL math is a single call into stable_baselines3.

Patterns you’ll want to know if you read or modify:

  • The agent runs in the main daemon process — there is no separate AI worker. Inference + training happen on the Pi Zero’s CPU (which on the Zero 2 W is a 4-core A53; fast enough).
  • Model saves happen every 10 minutes by default (configurable). The model file is /root/brain.nn — a pickled stable-baselines3 model, ~few MB.
  • The agent does not continue training during a software upgrade — when the daemon restarts it loads the saved model and resumes inference, but doesn’t auto-train across a model-version boundary. If you upgrade jayofelony from an older to newer release that changes the network architecture, you must delete /root/brain.nn and start fresh.

9. Failure modes

The agent can pathology in a few ways:

PathologySymptomCauseFix
Stuck on one channelCaptures stop growing despite a busy environmentLocal maximum in the policy — agent thinks channel 6 is “always the answer”Increase ent_coef, or delete /root/brain.nn and re-train fresh
Constantly hoppingCaptures low; the gotchi looks freneticCritic underestimating value of dwell — agent never sees enough reward to commitIncrease n_steps, lower learning_rate
Trains to “do nothing”Capture rate ~0 despite a busy environmentReward function imbalance — quiet-channel penalty is being underestimated and agent chooses quiet-listen mode constantlyTune the reward weights in /usr/local/share/pwnagotchi/pwnagotchi/ai/reward.py (you have to fork; this isn’t surfaced in config.toml)
brain.nn corruptionDaemon crashes at load with a pickle errorSD card corruption + the pickled model is unfixableDelete /root/brain.nn
Training crashes on Pi Zero WOut-of-memory at agent.learn() callThe original Zero W has 512 MB and the stable-baselines3 training step doesn’t fitUpgrade to Pi Zero 2 W, or use MANU/AUTO mode

10. Should you train, or load a community model?

The community has historically shared trained brain.nn files — “here’s the brain of a Pwnagotchi that ran in a Berlin coffee shop for six months.” These are interesting as artifacts but not very useful as starting points, because the policy is environment-specific. A coffee-shop-trained brain in a suburban garage will mostly act lost for the first week.

The recommended path:

  • Start fresh — let the agent initialize randomly.
  • Run AI mode for ~2 weeks in your typical environment.
  • If after 2 weeks the capture rate is below what AUTO mode achieves in the same spot, your environment isn’t varied enough to benefit from RL — switch to AUTO.

11. The PPO option (advanced)

Stable-baselines3 also ships PPO (Proximal Policy Optimization) — a more modern, more sample-efficient algorithm in the same actor-critic family. Some community forks have swapped A2C for PPO. The change is small (a one-line constructor change). Anecdotal reports: PPO converges faster and to a marginally better policy in noisy environments, at the cost of slightly higher per-step CPU.

If you’re modifying the agent, PPO is the obvious modern alternative. For 95% of users, the default A2C is fine.

12. Cheatsheet updates from this volume

Items to roll into Vol 12 (laminate-ready cheatsheet):

  • “AI mode = A2C in stable-baselines3, tuning bettercap params.” (§1, §2)
  • “State vector ≈ 57 dims (channel densities + handshake yields + temporal context).” (§3)
  • “Action ≈ (channel, deauth aggression, PMKID rate, dwell time, quiet mode).” (§4)
  • “Train fresh — community brains are environment-specific.” (§10)
  • “AI > AUTO if the gotchi moves around. AUTO is fine if it stays put.” (§7)
  • “If stuck: increase ent_coef in [personality.ai].” (§9)