Greedy Farthest-Point Sampling: How to pick 100 founders that aren't all the same
The Cambrian sim starts with 100 founder eggs. Each egg has 8 traits, each trait has 4-5 alleles. That’s about 160,000 possible genomes.
If I pick 100 of those at random, I get clustering. Some regions of trait space have 5 nearby founders; others have none. The starting population is biased — and any biases at the start get amplified over 500 generations.
What I wanted: 100 founders that are as spread out as possible. Maximum coverage of the trait space.
The algorithm that does this is called greedy farthest-point sampling. It’s old. It’s simple. It’s exactly what you want.
The algorithm
def make_founders(n: int, seed: int) -> list[dict]:
rng = random.Random(seed)
# Generate a big candidate pool — way more than we need
pool = [random_genome(rng) for _ in range(n * 8)]
# Pick the first founder at random
chosen = [pool.pop(0)]
# For each subsequent founder, pick the candidate
# that maximizes minimum distance to all existing chosen
while len(chosen) < n:
best_candidate = None
best_min_dist = -1
for candidate in pool:
min_dist = min(genome_distance(candidate, c) for c in chosen)
if min_dist > best_min_dist:
best_min_dist = min_dist
best_candidate = candidate
chosen.append(best_candidate)
pool.remove(best_candidate)
return chosen
Twelve lines. That’s it.
Why it works
The “farthest” in “farthest-point” is the key. We don’t pick the farthest point from the centroid (that would just push everyone to the edges). We pick the point that’s farthest from its nearest already-chosen point.
This means each new founder fills the biggest gap in the current set. After 100 picks, no two founders are too close together, and the entire trait space has roughly uniform coverage.
It’s like Poisson disk sampling, but cheaper and you don’t need to know the right radius in advance.
What it cost
- Pool size: 8x the number of founders (so 800 candidates for 100 founders). More candidates = better coverage. 8x is a sweet spot.
- Compute: O(n² × pool_size) — not great asymptotically, but n=100 is tiny so it runs in 0.3 seconds.
- Memory: trivial. 800 genomes = a few KB.
For larger n, you can use approximation algorithms (k-means++ initialization, Sobol sequences, etc.). For our scale, the greedy algorithm is plenty.
The result
The 100 founders span the trait space evenly. Every combination of palette × pattern × size × metabolism × lifespan has at least one nearby founder. The starting cladogram has 100 distinct species (not 30 with overlapping niches).
Without farthest-point sampling, the Cambrian sim collapsed to 1-2 species within 50 generations because all 100 founders were too similar to maintain reproductive isolation. With it, the sim produces 101 species over 500 generations and a beautiful cladogram.
Where else this applies
Anywhere you need diverse samples from a large discrete space:
- A/B test cohorts: pick N users that span demographic dimensions
- Synthetic training data: pick N prompt seeds that span topic/length/tone space
- Curriculum design: pick N exercises that span difficulty/concept space
- Survey sampling: pick N respondents that span geographic/age/income space
- Initialization for genetic algorithms: pick N starting solutions that span the search space
If you’re sampling and you care about coverage (not just statistical representativeness), farthest-point is your friend.
The implementation is in scripts/cambrian.py, function make_founders. ~20 lines including the candidate pool generation. Steal it. Use it. Spread your samples out.