How to Build a Content Extractability Harness
A small rule engine that catches weak openings, headings, and structure before publish.

If you want AI systems to cite your content, the page has to be easier to parse than the alternatives. A content extractability harness is a small rule-based checker that scores draft markdown before publish, so we catch weak headings, buried answers, missing structure, and unusable tables before the page goes live.
What a content extractability harness does
A content extractability harness is not a quality-writer. It is a mechanical filter. It answers a simple question: can a machine pull a clean claim block from this draft without guessing?
That matters because AI search and answer systems reward pages that are easy to segment, attribute, and quote. Google says structured data helps search systems understand page meaning, and the Princeton/Georgia Tech GEO paper showed that content changes can materially affect AI visibility.
- Google structured data docs: https://developers.google.com/search/docs/guides/intro-structured-data
- GEO paper: https://arxiv.org/abs/2311.09735
The practical move is to check structure before publish, not after traffic disappears.
The structure we test first
We usually test four things before anything else:
| Check | What it protects | Why it matters |
|---|---|---|
| Answer-first opening | The first paragraph should directly answer the target query | The opening block is often what gets extracted |
| Keyword headings | At least one H2 should mirror the main query terms | Headings help systems segment the page |
| Citable blocks | Each section should contain one standalone claim | A model should not need context to quote it |
| Structured data | Tables, lists, or definition blocks where structure matters | Prose-only comparisons are harder to extract |
That table is boring on purpose. Boring is good. Machines like boring.
The scoring model we use
The harness does not need a database or a fancy classifier. A few deterministic checks catch most failure modes.
export function scoreExtractability(markdown, targetQuery) {
const checks = [];
checks.push(answerFirst(markdown));
checks.push(keywordHeading(markdown, targetQuery));
checks.push(citableSectionBlocks(markdown));
checks.push(structuredElements(markdown));
checks.push(linkHygiene(markdown));
const total = checks.reduce((sum, item) => sum + item.score, 0);
return {
total,
checks,
pass: total >= 8.0,
};
}
Each function is simple on purpose. If your content system needs a fine-tuned model just to tell whether an intro answers the question, the system is already too complex.
1. Answer-first opening
The first 40 to 60 words should tell the reader what the page does. Not the story. Not the thesis. The answer.
function answerFirst(markdown) {
const firstParagraph = markdown
.split(/\n\n+/)
.find(block => block.trim().length > 0) || "";
const hasDirectVerb = /\b(is|are|means|does|works|helps|scores|checks)\b/i.test(firstParagraph);
const hasTargetTerms = /extract|citat|answer|structure|schema/i.test(firstParagraph);
return {
name: "answer-first",
score: hasDirectVerb && hasTargetTerms ? 2 : 0,
};
}
This check is crude. That is fine. Crude catches the obvious failures, and obvious failures are what kill most content.
2. Keyword headings
We do not want decorative headings. We want headings that tell a machine what the section is about.
function keywordHeading(markdown, targetQuery) {
const headings = markdown.match(/^##\s.+$/gm) || [];
const terms = targetQuery.toLowerCase().split(/\s+/).filter(Boolean);
const matched = headings.some(h =>
terms.some(term => h.toLowerCase().includes(term))
);
return {
name: "keyword-heading",
score: matched ? 2 : 0,
};
}
If the query is about content extractability, at least one heading should say something close to that. If the query is about AI citations, the heading should say AI citations. That is not clever. It is useful.
3. Citable section blocks
Every section should have a sentence that can stand alone outside the paragraph it lives in.
A good section looks like this:
A structured opening paragraph gives the extractor a clean claim block.
Then the section explains why that claim is true and adds evidence.
A bad section talks for six sentences before it lands the point. By then, the model has already moved on.
function citableSectionBlocks(markdown) {
const sections = markdown.split(/\n##\s+/).slice(1);
const scored = sections.filter(section => {
const sentences = section.split(/(?<=[.!?])\s+/);
return sentences.some(s => s.length < 180 && /\b(is|are|means|works|helps|should|can)\b/i.test(s));
});
return {
name: "citable-blocks",
score: sections.length > 0 && scored.length / sections.length >= 0.75 ? 2 : 0,
};
}
The threshold is arbitrary. The point is not precision. The point is pressure.
Why tables beat prose for comparisons
Structured data is easier to recover than scattered prose. Google’s docs on structured data are explicit about using markup to help systems understand page meaning, and the same logic applies to content extraction inside AI answer systems.
Compare these two formats:
Bad: The first approach is easier for humans in some cases, but the second approach often performs better when the page needs clear structure and repeatability.
Good:
| Approach | Best for | Weakness |
|---|---|---|
| Prose | Narrative explanation | Harder to scan |
| Table | Side-by-side comparison | Less elegant, more extractable |
The table wins because it removes ambiguity. A machine does not have to infer the comparison axis. Google’s own structured data examples show why this matters: their docs cite cases where structured pages improved CTR and visits, including Rotten Tomatoes, Food Network, Rakuten, and Nestlé.
How to keep the harness lightweight
A lot of teams overbuild this. They turn a content check into a platform project. That is a mistake.
The harness should stay close to the markdown source:
- Run on draft save
- Fail fast on missing answer blocks
- Warn on headings that do not include target terms
- Flag pages with no table, list, or definition block when the topic needs one
- Block publish when the opening paragraph buries the answer
function overallDecision(totalScore) {
if (totalScore >= 8) return "pass";
if (totalScore >= 6) return "revise";
return "block";
}
That is usually enough. Most quality improvements come from a few strong constraints, not a hundred weak ones.
What the harness cannot do
It cannot rescue a weak idea.
If the topic is vague, the writing can be technically clean and still feel empty. If the page is just marketing in a lab coat, a harness will not save it. It will only prove the page is polished nonsense.
This is why the query matters. The content needs a real question underneath it. The harness can only help you answer that question cleanly.
A practical publishing flow
The best flow is simple:
- Pick one developer query
- Draft the answer first
- Add one citable block per section
- Use tables when you are comparing anything
- Run the harness
- Fix the obvious failures
- Publish only when the page can be extracted without help
That workflow is mundane. It also works.
The Princeton/Georgia Tech GEO work is useful here because it reminds us that content changes are not cosmetic. The way a page is written affects whether it becomes visible. That is not a branding problem. It is an extraction problem.
Where this fits in the stack
A content extractability harness sits between the editor and the publish button.
It is not SEO. It is not analytics. It is the last cheap chance to make the page legible to machines before it becomes expensive to fix.
That makes it useful for any team publishing at scale, especially teams writing for AI search, answer engines, or systems that summarize sources instead of just indexing them.
FAQ
What is a content extractability harness?
A content extractability harness is a rules-based checker that scores whether a draft can be cleanly parsed by humans and machines. It looks for answer-first openings, keyword headings, citable blocks, and structured elements.
Does this replace human editing?
No. It catches mechanical problems. Human editing still has to decide whether the argument is worth publishing.
Why not use a model for this?
You can, but you do not need to. Many of the worst failures are simple structural problems, and simple rules catch them faster and more reliably.
What kind of content benefits most?
Content that needs to be cited, summarized, or repurposed by search and answer systems. That usually means technical explainers, framework pages, and methodology posts.
What is the main failure mode?
Weak structure. If the page hides the answer, buries the heading, or presents comparisons as prose, extraction gets worse.
Patterns drawn from a markdown scoring harness that checks headings, answer blocks, and structure before publish. AuthorityTech is the first AI-native Machine Relations agency.






