Perspectives  ›  Predictive Coding

Seed Set Selection in Predictive Coding: How Initial Training Data Shapes Review Quality

Naomi Ashford · Founder & CEO, Discovarc · · 10 min read
Seed set selection concept diagram for predictive coding blog post showing initial training document selection

When practitioners ask why two TAR reviews of similarly sized corpora produce such different quality outcomes, seed set design is almost always part of the answer. The seed set — or initial training corpus in a continuous active learning context — is the collection of attorney-reviewed documents that initializes the classifier. Everything the model "knows" at the start of the review comes from these documents. Poor seed set design creates a model that confidently misclassifies large portions of the corpus, often in ways that are not visible until elusion testing surfaces the problem late in the review cycle.

This piece addresses what well-designed seed sets look like in practice, the common construction errors that create downstream problems, and how the approach differs between TAR 1.0 simple passive learning (SPL) and TAR 2.0 continuous active learning (CAL) protocols.

What the Classifier Needs to Learn

A document classifier operating on ESI typically represents each document as a feature vector — a mathematical object encoding word frequency patterns, phrase presence, or (in more sophisticated implementations) semantic embeddings. Logistic regression and support vector machine (SVM) classifiers, which remain common in commercial TAR platforms, learn a decision boundary between the "responsive" class and the "non-responsive" class based on the feature vectors of the labeled training examples.

For the classifier to generalize — to correctly classify documents it has never seen — the training set must represent the full range of document types, communication styles, and subject-matter contexts that appear in the responsive portion of the corpus. A seed set that contains only clearly responsive documents of a single type (say, executive emails discussing the core issue in litigation) will produce a classifier that performs well on similar documents but poorly on responsive documents in different custodian communication styles or different document formats.

The seed set must also include negative examples — documents the reviewing attorneys have identified as non-responsive. The quality and variety of the non-responsive training examples shapes how the classifier distinguishes signal from noise at the margin, and this is where many seed set designs fail: the non-responsive training documents are treated as a filler category rather than as a curated set of documents that represent the actual non-responsive landscape.

Seed Set Size: What's Enough?

The academic literature on active learning suggests that classifier performance typically plateaus as training set size increases beyond a few hundred to a few thousand examples, assuming the examples are well-selected. Practical e-discovery implementations tend to use seed sets in the range of 500 to 2,000 documents for TAR 1.0 SPL protocols, with the specific target depending on corpus size, responsive prevalence, and matter complexity.

Smaller seed sets require higher-quality document selection to achieve acceptable classifier performance. Larger seed sets provide more representative coverage but require more attorney review time to code. The tradeoff is not linear: doubling the seed set size does not double classifier accuracy. The diminishing-returns curve means that the marginal value of seed set expansion, after the first few hundred well-selected documents, is typically lower than the cost of attorney review time to code additional examples.

For CAL protocols, the initial training round size is a somewhat different question. CAL systems typically start with a smaller initial batch and immediately use the model's output to prioritize the next review batch — so the initial training set quality matters most for the model's initial orientation, not for the final accuracy level (which is determined by the cumulative training over many iterations). A CAL implementation can recover from an imperfect initial training set more readily than an SPL implementation, because the continuous feedback loop corrects early errors as the review proceeds.

Targeted Seed Set Construction: Core Techniques

Three techniques are most commonly used for seed set construction in well-designed TAR 1.0 protocols:

Known-relevant document seeding. The most commonly recommended approach for TAR 1.0 is to front-load the seed set with documents already known to be relevant — hot documents from prior productions, communications identified in document requests, documents produced in related litigation. This gives the classifier its first signal about what "responsive" looks like from the specific matter's document universe. The limitation is that known-relevant documents tend to be the most clearly relevant — core communications about the central issues — and the classifier may generalize poorly to peripheral responsive documents (those that are relevant but in less obvious ways).

Issue-stratified random sampling. A stratified random sample pulled from different custodian populations and time periods, then coded by attorneys with knowledge of the review issues, provides more representative positive and negative examples than a purely purposive seed set. The attorneys coding the stratified sample must understand all review issues — not just the most salient issue in the litigation — or the seed set will underrepresent multi-issue responsive documents.

Concept-cluster seeding (Brainspace-style). Platforms that implement concept clustering — grouping documents by semantic similarity before the TAR classifier is initialized — allow for seed set selection that is explicitly diverse across concept clusters. The reviewing attorney selects a small number of examples from each identified cluster, ensuring the seed set contains representatives from all major topic areas in the corpus. This approach requires a platform that implements concept clustering (Brainspace, Reveal, and some Relativity configurations), and the cluster quality depends on corpus size and document type diversity.

The Over-Seeding Error

One of the less-discussed seed set construction errors is over-seeding with privileged documents. Privileged documents are often among the first documents attorneys pull for seed set construction, because they are already known (documents from prior privilege logs, communications with litigation counsel during the relevant period). If the seed set contains a high proportion of privileged documents coded as non-responsive — which is technically correct, but for a different reason than ordinary non-responsive documents — the classifier may learn to associate attorney-communication features with non-responsive status, which can systematically under-score responsive documents that also involve attorney communication.

The cleaner approach is to process privilege review and responsiveness determination as separate classifiers or separate review tracks, with the seed set for the responsiveness classifier built from non-privileged documents only (or with privileged documents excluded from the training corpus). This is particularly important in matters where a significant portion of the responsive population involves communications with in-house counsel about business matters — communications that are responsive but privilege-complex.

Scenario: Multi-Custodian Antitrust Matter

Consider a fictional scenario representative of the type of challenge that arises in practice: an internal investigation involving nine custodians across three business units, spanning a four-year period, with potential antitrust exposure. The ESI population after processing contains 650,000 documents. Responsive prevalence is estimated at 8-12% based on a preliminary statistical sample.

The litigation support director considers three seed set approaches. A simple random seed set of 1,000 documents would yield approximately 80-120 responsive examples given the estimated prevalence — a training corpus that may be insufficient for a classifier to learn the variety of responsive document types across three different business units with different communication cultures. A known-relevant seeding approach, using documents already identified in the investigation's preliminary interviews, would provide 300-400 well-coded responsive examples but would be skewed toward executive-level communications and underrepresent the operational-level documents that may also be responsive.

The chosen approach: stratified sampling across custodians and time periods (ensuring representation from each business unit), with supplemental seeding from already-identified hot documents and an attorney coding session focused specifically on borderline relevance judgments — the hardest cases, not just the clear calls. The borderline examples train the classifier more effectively at the decision margin than clear positives or clear negatives.

We're not saying any single seed set construction approach is universally correct — matter-specific factors (corpus composition, responsive prevalence, available attorney time for training, platform capabilities) all shape the right design for a given review. The point is that seed set design deserves the same deliberate attention that the ESI protocol and the validation framework receive.

Documentation and Protocol Integration

Your seed set construction methodology should be documented in the TAR protocol with enough specificity to support the certifying attorney's 26(g) obligations. At minimum, document: the number of documents in the seed set, the coding methodology (who coded them, under what instructions, applying what issue criteria), and the quality-check process applied to the seed set coding before model training began.

In matters where the opposing party has requested transparency into the TAR process, the seed set documentation is often one of the first things they ask to inspect. Courts that have ordered TAR protocol disclosure — including In re Biomet (N.D. Ind. 2013), which allowed limited disclosure of training document composition as part of TAR validation — have looked at whether the seed set construction was designed to produce a representative training corpus, not whether it produced the best possible classifier in the abstract.

Privilege determinations made during seed set review remain counsel's responsibility — no classification tool replaces that judgment. If your matter requires a TAR protocol review with specific attention to seed set design, the walkthrough request is the appropriate starting point.