Architecture & Build

AI Capability Toolkit

Spring 2026

BIG
Tools

BIG Tools is a toolkit for developing ideas under AI-augmented load — one Known-Knowns gate plus seven tools that govern how AI verifies and enhances user-generated work, never replaces it.

I designed and built it as a direct answer to the Evaluation Crisis Spiral: the failure mode where AI generation outpaces human evaluation capacity, and confident-but-wrong outputs slip through review. The architecture inverts the usual arrangement — humans generate, AI verifies — and the seven tools each govern one cognitive move on top of that inversion.

The Challenge

How this got built

The Six Laws didn't come from theory first. They came from six months of building DevLex — a personal knowledge system fed articles, peer journals, podcasts, anything I might naturally read about my own work, and asked one question: given all of this, what does the data suggest I do?

DevLex became a colleague that read meeting transcripts, a friend that gave honest support, and most importantly a mentor that knew my prior project work, understood my present context, and helped me plan the next move. The Six Laws were what fell out of that practice, articulated only after enough sessions to see the patterns. BIG Tools is the architecture that operationalizes those laws on a specific class of work: helping someone develop an idea without flattening it.

The Evaluation Crisis Spiral

The dominant pattern in "AI productivity" tools is the same shape every time: AI generates, human reviews. Under load, that arrangement collapses. Generation accelerates. Review capacity does not. The gap fills with confident-but-wrong outputs that pass scrutiny because nobody has time to actually check them.

I documented this failure mode across nine domains — education, peer review, code review, content moderation, clinical decision support, legal discovery, and others — while synthesizing the Six Laws of AI-Era Software Engineering. The pattern repeats because the architecture is wrong, not because the model is wrong.

I built the architecture against this spiral because I myself keep falling into it. The laws are easy to articulate and hard to internalize.

What Actually Breaks

Generation outpaces evaluation. The reviewer becomes a button-presser, not a judge.
Context degrades under speed. Tacit knowledge — the situation-specific understanding that makes review meaningful — never enters the system.
Authorship erodes. The output is the model's; the human signs off. Provenance evaporates.

“What if we inverted the arrangement? Humans generate. AI verifies. The architecture is built to make that the only thing the system can do.”

— The design hypothesis behind BIG Tools

The Architecture

BIG Tools is structured around three architectural decisions that, together, force the inversion. None of them are about which model to use. All of them are about where attention is allowed to go.

Gate as Priority

Every session is shaped around the Known-Knowns gate — three questions on what the user already understands in head, gut, and heart. The gate is the architectural priority: tools surface its prompt before any other affordance, the launcher highlights it as the entry move, and tool prompts inject filled responses as user-stated context. But the gate is a guiding principle expressed in UI, not a wall enforced in the API. The first version enforced the gate as a hard wall in the API. In practice, that rendered the tools useless — second-order flows that needed to stand up an idea before they could ask the user about it had nothing to grip. The gate was lifted from the code level and renamed from Tacit-Context to Known-Knowns. The architectural priority survived; the coercion didn't. The walk-back made the principle stronger by making it structural rather than enforced.

Tools, Any Order

After the gate, the seven tools operate any-order on the same shared context. There is no fixed sequence and no hidden state machine. The user picks the next move based on what the work needs.

Tracker as Spine

Every tool invocation lands in the Tracker. The trail through the work — what was tried, what was kept, why — is itself a deliverable. Provenance is a first-class output, not an afterthought.

Architecture diagram: Known-Knowns gate → seven tools (any-order) → Tracker spine

[Diagram export pending from talk deck]

Why this shape

The conventional "AI assistant" pattern is a conversational front end over a chain. The user types. The model generates. The user evaluates. The chain is hidden; the context is implicit; the provenance is lost the moment the session ends.

BIG Tools makes every step visible. The gate forces tacit context into the open. The seven tools each govern one cognitive move and produce a named artifact. The Tracker accumulates those artifacts in order of operations, so the shape of the work is recoverable later — by the user, by a collaborator, or by a future reviewer auditing how a decision actually got made.

The Seven Tools

Each tool governs exactly one cognitive move. They are composable, replaceable, and ordered by the user — not by the system. The list is small on purpose; the architecture would break with twenty.

The seven tools sort cleanly by cognitive function

  • Reflective: Known-Knowns
  • Generative: Reframe, Mixtape
  • Constrain: Alignment
  • Explore: Collaborate
  • Critique: Falsification, Friction-on-Demand

The Tracker spans all of them as the persistence spine. Composition is by cognitive move, not by sequence — which is what "any-order" actually means. Two users with the same problem will land on different orderings because they need different moves at different times.

Reframe

Five d.school moves on the user's idea: Focus (narrow to the smallest, sharpest version), Feel (find the emotional core), Challenge (name the load-bearing assumption to test), Borrow (steal a move from an adjacent field), Flip (invert the assumption you didn't know you were making). The user writes their own per-move response. AI offers alternative angles only when asked, and only after the user's own pass. AI enhances; it does not generate.

Cognitive move: shift the frame on something the user already understands.

Field Test: Brown Practicum

BIG Tools went into real use during the Spring 2026 cross-institutional Embodied Brain Technology Practicum at Brown's Carney Institute. Twenty-four students from five institutions used the toolkit to develop ambitious project ideas under live faculty review.

24

students using BIG Tools end-to-end

5

institutions: Brown, Ben-Gurion, MIT, Rochester, CMU

2026-04-29

Class session: BIG Tools and the Six Laws introduced to the cohort

Practicum cohort photograph at Carney Institute, Spring 2026

[Image to be added]

What the toolkit had to handle

  • Student-facing mode for individual idea development.
  • Facilitator mode for faculty reviewing the trail through a student's work.
  • Cross-institutional sessions with faculty and neuroscientists from five universities.
  • Provenance under faculty review, not just under the student's own gaze.

What field use revealed

  • The gate is the load-bearing decision. Sessions that bypassed it produced flat work; sessions that engaged it produced authored work.
  • Tool order varied wildly. No two students used the same sequence. Forcing a fixed order would have broken the toolkit.
  • The Tracker became the deliverable faculty actually wanted to read — more than the polished output.
  • Three tools from the v0 set had to be removed. See Honest Engineering, below.

Tracker view from a student session

[Screenshot to be added]

Mixtape tool: cross-domain borrowing in action

[Screenshot to be added]

The protocol parallel

The 4/29 class session closed Law 2 by reading a question I'd posed to Grace Zheng and Xin Jin — the founders of PerturbAI, neuroscientists at Scripps Research — on an OpenAI Forum panel. The question was about how to handle LLMs that "get over their skis," pushing into areas beyond human understanding where outputs aren't necessarily hallucinations but might be novel. Their answer: don't suppress novelty; put it inside a rigorous validation loop. Surprising outputs are hypotheses, not conclusions.

That protocol — separation of generation from evaluation, validation as a structured loop — is the same shape BIG Tools is built around at idea-scale. Per-idea provenance (Known-Knowns plus Tracker), separated tool runs (the any-order architecture), independent verification (Falsification, Friction-on-Demand). Different domains; same protocol shape. Three practitioners arriving at the same architecture from neuroscience, group facilitation, and AI-assisted idea work is the non-trivial signal that the shape is general, not domain-specific.

Six Laws as Theoretical Spine

The Six Laws of AI-Era Software Engineering are not marketing copy bolted onto the toolkit. Each law is observable in the architecture; each shaped a specific design decision. The toolkit is the laws made operational.

Read the full Six Laws framework →

Law 1 — Context Is the Universal Bottleneck

v1.7

How it shaped the toolkit: The Known-Knowns gate is the architectural priority because context is the load-bearing input, not the prompt. The gate surfaces first in every tool and the launcher highlights it as the entry move, so even when it is unfilled, every downstream surface is shaped around the question of what the user already understands.

Law 2 — Human Judgment Remains the Integration Layer

v1.7

How it shaped the toolkit: The inversion (humans generate, AI verifies) is this law made structural. The user authors; the tools test, reframe, or pressure. The model never has the final word.

Law 3 — Architecture Matters More Than Model Selection

v1.6

How it shaped the toolkit: Each tool is model-agnostic. Swap the LLM behind any tool and the architecture holds. The gate-first / any-order / Tracker-spine shape is what survives.

A concrete example from this week: Collaborate originally asked the LLM to recall arxiv papers from training data, with a prompt-level guard against hallucination ("don't invent papers; say so explicitly if you can't find real ones"). The guard worked when the model was honest about its gaps and failed when it wasn't. The fix was structural, not prompt-level: replace the LLM recall step with a direct hit to arxiv's ATOM API. Now every returned paper has a real arxiv id and DOI by construction. There is no recall step to fabricate from. The architecture eliminated a class of failure that the prompt could only pattern-match against.

Law 4 — Build Infrastructure to Delete

v1.3

How it shaped the toolkit: Three tools (NGT, Paper House, Pitch) shipped in v0 and were removed when field use proved them load-bearing in the wrong way. Deletion was made cheap on purpose.

Law 5 — Orchestration Is the New Core Skill

v1.9

How it shaped the toolkit: Tool order is the user’s job, not the system’s. The toolkit teaches orchestration by refusing to do it for you. Two students with the same problem will produce different Tracker traces.

Law 6 — Speed and Knowledge Are Orthogonal

v1.4

How it shaped the toolkit: Friction-on-Demand is this law operationalized. Speed is available everywhere; structured friction is available where speed costs judgment. The user picks.

Honest Engineering

The first version of BIG Tools shipped with ten tools. Field use at the practicum revealed that three of them did not belong in the gate-first / any-order architecture — they were load-bearing the wrong way, or duplicated cognitive moves already covered, or pulled the user away from authorship instead of into it. They were removed in pre-demo remediation.

Internal architecture decision records (ADR-038 and ADR-039) document the deletions and the principles that forced them. The point is not that I made mistakes; the point is that the architecture made the mistakes correctable.

NGT (Nominal Group Technique)

Five AI runs are not five humans externalizing. The conceit was that AI could substitute for the multiplicity of voices NGT depends on; field use revealed it couldn’t. Removed for violating P14 (Human-AI Explanatory Division): the human-only cognitive move that NGT formalizes can’t be done by an LLM, no matter how the runs are framed.

Paper House

LLM-as-judge of readiness. The tool asked the model to score whether an idea was ready for the next stage — the kind of categorical judgment that has to stay with the user. Removed for violating P18 (Tool Metaphor Over Agent): tools verify and enhance; they don’t render verdicts.

Pitch

Internally I called this the microwave chicken problem: AI generating a conviction-bearing artifact extinguishes the conviction. The user’s pitch is the artifact where authorship has to be visible; outsourcing it to a model produces something that performs the form without carrying the substance. Removed for violating P14 (foundational version). The replacement is the Tracker plus Friction-on-Demand: the user’s actual reasoning trail, ready when needed, never pre-baked.

Bonus — the cross-run viz wasn't actually shipping

The first version of Collaborate's cross-run visualization wrote each saved run to a Neo4j AuraDB instance ($65/month, separate cloud) and was supposed to read it back from a graph viz panel on the memory page. End-to-end verification on production showed: the writes were firing into a void (Vercel's serverless lifecycle terminated the function before the async Neo4j write completed), and the read panel was wired to a different Neo4j instance entirely (the one that powers DevLex synthesis). The viz had no path that worked. The fix was to drop Aura and aggregate cross-runs from a JSONB column in the Postgres table where the runs already lived. Same visualization, $0/month, one source of truth, no fire-and-forget writers. The code that survived was the smaller, simpler version that could only have been written after the broken version revealed what the viz actually needed.

Key Takeaways

Architecture beats prompt engineering

The Evaluation Crisis Spiral is not a prompt problem. It is an architecture problem. The fix is structural: where attention is allowed to land, what artifacts are preserved, what the system refuses to do for the user.

Make the gate the priority, not the wall

Tacit context — the situation-specific understanding that makes review meaningful — never enters the system on its own. Surface the gate everywhere, in every tool, as the obvious entry move. Don't enforce it as an API wall; the first version did and broke the second-order flows that needed to stand up an idea before they could ask the user about it. The walk-back made the principle stronger by making it structural rather than coercive.

Refuse to orchestrate for the user

Tool order is the work. A toolkit that picks the order steals the cognitive move that matters most. Any-order is a teaching choice, not an engineering shortcut.

Treat provenance as deliverable

The Tracker turned out to be the artifact reviewers actually wanted to read. The trail through the work is more valuable than the polished output, because the trail is what survives review under load.

A note on attribution

BIG Tools is the architecture I built for an idea-development class I co-teach at Brown University with Chris Moore and Carl Moore. The pedagogy half is theirs — Chris brings the cognitive-neuroscience frame, Carl the BIG Ideas methodology refined over five decades of facilitation. The software-architecture layer, the Six Laws synthesis, and the toolkit implementation are mine. The deletions described above name where Carl's critique drove specific architectural decisions; the seven-tool toolkit shipped at seven, not ten, because of that audit.

Canonical content • bigtools.dev

Toolkit, recipes, and the Six Laws blog post

Try the toolkit. Read the session notes. Watch the framing land.

bigtools.dev hosts the canonical content, the recipes, and the Six Laws blog post. The 4/29 class session walks the full architecture in 30 minutes.