Brave New Word

loading…

Preset ?settings macro — picking one sets the controls below to a known-good recipe (you can see exactly what changed). Touch anything afterwards and you're back to custom.

Vocabulary ?--corpus — the wordlist whose letter patterns are learned. Language lists are the 10,000 most frequent words from film subtitles; the full dictionary skews scientific; paste your own for themed words.

Blend with ?corpus blending (training-weight mix) — learn from two vocabularies at once: 70% Spanish + 30% Japanese makes words from a language that doesn't exist.

How many words ?-n — how many novel words to generate per run.

Building blocks ?--mode char|syllable — what the generator chains together: single letters (finer, stranger) or whole syllable chunks (chunkier, very pronounceable).

Length to ?--min / --max — keep only words within this character range. Shorter range, fewer surprises; very long words tend to ramble.

Capitalization ?--caps random|lower|title — Random capitalizes each word's first letter with a coin flip, so some read as names and some as common words.

advanced settings

Pattern memory 3 ?--order — how many previous letters (or syllables) the model looks at to pick the next one. 2 = drunk but inventive, 3 = plausible, 5+ = mostly recreates real words that the novelty filter then rejects.

Weirdness 0 ?--prior — a small chance that any letter can appear even where the vocabulary never used it. At 0 every letter sequence in the output exists somewhere in real words; above 0.05 it turns to alphabet soup.

Favor common words ?--weighted — weight training by word frequency (the list is ordered most-common-first), so everyday words shape the patterns more than obscure ones. Only meaningful for frequency-sorted lists; disabled for the alphabetized dictionary.

Reject all real words ?--exclude — also block every word in the full 370,105-word English dictionary from appearing as output, not just the training vocabulary. Downloads the list (~1 MB) on first use.

No repeats (session) ?session novelty memory — a word shown once since this page loaded (or already in your kept list) won't appear again while this is on. Note: makes results history-dependent, so the same seed can give different words and shared links may differ slightly.

Block hidden real words ?--max-overlap — sometimes two real words get glued together ("comment" + "tabs" → "commentabs"). This rejects any output that contains a known word of the chosen length or longer as a substring.

Sort results ?--sort none|likely|alpha — "most word-like" ranks by how probable each word is under the learned patterns; the top entries are the most convincing and also the most conservative.

Starts with ?--starts — keep only words beginning with this (lowercase). Strict filters may yield fewer words than asked.

Ends with ?--ends — keep only words ending with this.

Contains ?--contains — keep only words containing this somewhere.

Seed ?--seed — a number that fixes the randomness: the same seed with the same settings always produces the same words. Leave blank to get fresh words each press.

how it works

Brave New Word generates novel words — strings absent from its training vocabulary that follow its letter patterns. The engine is a character (or syllable) n-gram Markov chain with simplified Katz back-off and an optional Dirichlet prior, compiled from Rust to WebAssembly. Everything runs in your browser: no server, no tracking, nothing you type leaves this page.

Training: each vocabulary word is padded with start sentinels and an end sentinel, then every windowed transition is counted for all context lengths 0..=N (N = pattern memory). Generation samples the next token at the longest context with data, backing off one token at a time when a context was never seen; the empty context always has data, so sampling never fails. With weirdness 0, every n-gram in the output exists somewhere in the training vocabulary — that is what makes the words feel uncannily real.

Novelty: output words are checked against the training vocabulary (and, optionally, the full 370k-word English list) and rejected if they exist — every word shown is genuinely new. "Block hidden real words" additionally rejects outputs that merely glue two real words together. "Most word-like" sorting ranks candidates by their mean log-probability under the model, accumulated for free during sampling.

Syllable mode tokenizes words into consonant+vowel chunks and chains those instead of letters — chunkier and very pronounceable. Frequency weighting scales each training word's influence by its rank (Zipf), so "the" shapes the model ~10,000× more than the 10,000th word.

Blending trains on two vocabularies at once, scaling each word's training weight so the two lists split the influence by the mix slider (corpus sizes are mass-equalized first). The result honors both phonotactics — a 50/50 Spanish×Japanese blend is a language that has never existed.

Vocabularies: google-10000-english (Josh Kaufman; derived from the Google Web Trillion Word Corpus, Brants & Franz / LDC, via Peter Norvig), dwyl/english-words (Unlicense), language lists from FrequencyWords (Hermit Dave, CC-BY-SA-4.0; OpenSubtitles-derived, top 10,000 per language; Japanese transliterated to Hepburn romaji, kana-only entries), US census first names, world surnames (CC0), Latin proper names (CLTK, MIT), world cities, mythological figures (via godchecker.com), and Sanskrit lemmas from the Digital Corpus of Sanskrit (Oliver Hellwig, CC-BY-4.0, IAST). Coinage lists: YC company names, Drugs@FDA brand names (US public domain), and LDNOOBW profanity (slur entries removed from training; and since every output is novel, no actual profanity ever appears — it only sounds that way). Algorithm after markov-namegen and the RogueBasin article "Names from a high order Markov Process and a simplified Katz back-off scheme" (Lund). Built with the markov-words CLI's engine — same Rust code, zero dependencies.

BRAVE NEW WORD b0y.eu/bnw

words you never knew you needed