Building an Automated Analyst

May 12, 2026 · by Michael Morrison

How a five-stage pipeline turns RSS feeds into structured intelligence assessments, and why the distinction between summary and assessment shaped every technical decision.

The Citizen’s Daily Brief is a daily brief for the people, modeled after a similar intelligence brief that has been delivered to the U.S. President since 1946. Like the PDB, the CDB isn’t a news aggregator. It doesn’t summarize articles. It doesn’t curate links. It produces assessments — structured analytical judgments about what happened in the world, what it means, how confident we are, and what to watch next. That distinction sounds like marketing copy, but it’s actually the architectural decision that shaped every line of code in the CDB pipeline. And it’s a specific decision given that this project deliberately leans into the strengths of AI.

So why the big distinction between summary and analysis? A summarizer asks: “What does this article say?” An analyst asks: “Given everything we’re seeing across all sources, what’s actually happening, and how sure should we be?” That second question is the one I personally found hard to answer by just bouncing around news feeds each day trying to learn what’s going on in the world. But I wanted that daily assessment, the citizen’s version of what the U.S. President gets each day. Building the news analyst turned out to be a fundamentally different engineering problem than building a summarizer.

The Five Stages

The automated CDB information pipeline runs five stages, Monday through Saturday, completing before most Americans wake up. Here’s what each one does and why it exists.

Stage 1: Ingestion

Every morning, there’s an automated news fetcher that pulls from thirty-four curated RSS feeds spanning government sources, international outlets, wire services, specialist publications, and outlets from across the political spectrum. It grabs headlines, metadata, and summaries for anything published in the last 28 hours (overlapping with the previous cycle to catch late-publishing sources). The idea is to span enough disparate news sources to get a feel for what’s truly notable and worth assessing.

The 28-hour window is deliberate. Different outlets publish on different schedules. Wire services update continuously. Government sources often publish late in the afternoon. International outlets operate on different time zones. A 24-hour window would miss stories that broke at the boundary. The overlap means some stories appear twice in the raw input, but deduplication handles that downstream.

The fetcher normalizes everything into a common record format: headline, outlet name, outlet type, editorial perspective, publication timestamp, summary text, and source URL. Each source carries an editorial perspective label, which is a structural grouping that identifies whether the outlet operates from a wire, public media, broadcast, left-leaning, right-leaning, business, international, specialist, or other editorial position. I thought a lot about this piece, and wavered mightily on the left/right leaning part in particular. But the perspective identification is what allows the rest of the pipeline to not know or even care whether a record came from a BBC RSS feed or a Fox News Google News filter. It just sees structured records with perspective metadata.

After normalization, the fetcher runs a wire syndication detection pass. When AP or Reuters publishes a story, other outlets frequently republish the same wire copy, sometimes verbatim, sometimes with minor modifications. You’d be surprised at how much news is duplicated across sources. The fetcher identifies likely syndicated wire articles through byline credit patterns ("(AP)", “By Associated Press”, etc.) and title similarity matching against known wire headlines. It’s worth identifying and tagging syndicated articles so that downstream perspective counting doesn’t inflate independence scores.

Stage 2: Clustering and Significance Scoring

Stage 2 is where things get interesting and the analysis piece kicks in. It’s also really the most novel facet of the CDB architecturally. The normalized records are sent to Claude Sonnet with a prompt that asks it to do two things: group related stories into clusters, and score each cluster’s significance across five dimensions. The end result is a composite significance score that provides a means of whittling down daily news into what makes the brief.

Source volume — How many outlets are reporting this? A story covered by twelve sources is likely more significant than one covered by two, though not always.

Source diversity — How many distinct editorial perspectives cover this story? This is the most important dimension and carries the most weight in the composite score. A story reported by AP (wire), Fox News (right_leaning), NPR (public_media), and the Wall Street Journal (business) has four perspectives — genuinely cross-spectrum significance. A story reported by Fox News, the Daily Wire, Breitbart, the New York Post, and the Washington Examiner has five outlets but one perspective. The pipeline counts perspectives, not URLs.

Official action — Did an institution actually do something? A bill was signed, a rate decision was announced, a court ruling was issued — these are actions with concrete consequences, not just stories about stories.

Breadth of impact — How many people does this affect? A trade policy affecting global supply chains scores higher than a local regulatory change, even if the local change is more dramatic. There’s a balance here, and admittedly since the CDB is patterned on the PDB, it deliberately skews American, at least initially.

Novelty — Is this genuinely new, or is it the latest increment of an ongoing story? Novelty matters because the brief should tell you what changed today, not re-litigate what’s been developing for weeks.

Each dimension gets a sub-score from one to ten. The composite significance score combines the five sub-scores using a weighted formula: source diversity at 30%, official action at 25%, breadth of impact at 25%, source volume at 10%, and novelty at 10%. The weighting is deliberate — editorial diversity is the strongest signal of genuine significance, followed by institutional action and breadth of impact. Raw source volume is intentionally the lowest weight because volume without diversity can just mean amplification within a single editorial ecosystem.

The LLM returns all identified clusters, not just the top selections. Each cluster is marked as selected or unselected, with the top five to nine clusters selected for the brief. The full ranked list, including stories that didn’t make the cut, is preserved and published separately. This wider list exists because downstream consumers may need access to the complete significance-ranked picture, not just the curated top nine. A good example of this is the FICINT Fictional Intelligence feature, which uses the full ranked list of news stories to track if/when a theme tips into requiring a dossier addition.

That full ranked list of stories is important because it effectively provides “receipts” to validate why a story made the brief. The LLM provides analytical judgment within a framework, but the framework provides consistency. The sub-scores are visible. The weighting formula is documented. The full cluster list is preserved. If you want to audit why a story made the brief and another didn’t, the data is there. The CDB is all about “showing our work” — there’s nothing hidden, no ulterior motives.

Stage 3: Synthesis

For each selected cluster, Claude Sonnet writes the structured brief item. This is where the “assessment, not summary” distinction becomes concrete. The prompt asks for:

Headline — What happened, in one line
What changed — The specific new development (not background, not context, just the delta)
Why it matters — Analytical judgment about significance and implications
What to watch — Forward-looking: what would confirm or challenge this assessment
Why this made the brief — A structured receipt explaining inclusion: how many sources, how many editorial perspectives, which perspectives, and what triggered the significance score

And then there’s the trust infrastructure:

Common ground — Facts that all sources agree on (the verified baseline)
Key disagreements — For stories where sources disagree, the specific points of disagreement (not vague “sources differ” but “wire services report X while government sources cite Y”)
Open questions — Things we don’t know yet (explicit uncertainty)
Timeline — The sequence of events as reported across sources, including when a prolonged news story began
Source attributions — Which sources contributed what, with citation roles (primary, supporting, analysis, context, contradicting)

The prompt constraints are tight. Each field has length limits. “What changed” must be one to three sentences describing only the new development. “Why it matters” can’t restate the headline. “What to watch” must be forward-looking and specific, not vague (“watch for developments” is banned). The “why this made the brief” field follows a mandatory format: “Covered by N sources across M editorial perspectives ([list]). [Trigger sentence]. Significance rank: X of Y stories identified.” These constraints force the LLM to be precise rather than expansive, and make the inclusion rationale auditable. If you’ve spent much time with AI, you may have gathered by now that this isn’t just “ask ChatGPT to summarize today’s news.” The CDB is essentially a sophisticated, highly-tuned prompt engine that focuses on a very specific way to filter and analyze news.

Stage 4: Validation

Not every step in the process is AI-driven. Every synthesized item passes through a validator before publication. This is mechanical, not LLM-based — it’s Python code checking structural requirements:

All required fields present and non-empty
Field lengths within bounds (headline under 200 characters, etc.)
Trust signals internally consistent (high confidence requires 2+ sources — you can’t be highly confident based on a single report)
JSON schema conformance

If an item fails validation, it’s dropped. If fewer than three items pass, the entire brief is skipped for that day. This is the “publish nothing rather than publish garbage” principle encoded in code. It has triggered exactly twice in testing — both times because a source outage produced an unusually thin input set. Both times, skipping was the right call. Again, we’d much rather have no brief than a sketchy brief — this project lives and dies by its trustworthiness.

Stage 5: Publication

Items that pass validation are written to the database. A brief record is created, source records are inserted (deduplicated by URL), brief items are linked to their sources through a junction table with citation roles and display order. The brief status flips to “published” and the website regenerates.

The full cluster list, including unselected stories, is also persisted to a separate table. This creates a complete record of what the pipeline saw, how it ranked everything, and what it chose to include or exclude. The independent_source_count field on each brief item reflects the number of distinct editorial perspectives, not the raw number of source URLs.

The whole pipeline runs in about twenty minutes. Most of that time is LLM inference — one clustering call that sees all source records at once, then individual synthesis calls for each of the five to nine selected clusters. The Python code itself executes in seconds.

Why Not Just Summarize?

I keep emphasizing the assessment-versus-summary distinction because it’s the single decision that shaped everything else. If the CDB summarized articles, the pipeline would be trivially simple: ingest, concatenate, send to LLM with “summarize this,” publish the result. No clustering. No significance scoring. No trust signals. No validation. And honestly, no need for a web site or app, just ask the AI yourself.

Summaries answer “what did these articles say?” Assessments answer “what is happening in the world today and how confident should you be about it?” The second question requires comparing sources against each other, identifying agreement and disagreement, scoring significance across multiple dimensions, and being explicit about uncertainty. This approach presents a more interesting software engineering challenge, but in doing so it gets at the heart of what we really want to know as citizens: what happened in the world, and why should I care?

That’s what the pipeline does. Each stage exists because assessment demands it. Clustering exists because you can’t assess a story without first identifying that twelve different articles are about the same story. Significance scoring exists because you need to decide what’s worth assessing. The editorial perspective system exists because significance should reflect genuine editorial diversity, not volume from a single perspective. Validation exists because an assessment with inconsistent trust signals is worse than no assessment at all.

The Weekly Assessment

In addition to the Monday-Saturday briefs, there’s a deeper Weekly Assessment that publishes on Sundays. The weekly assessment runs on Claude Opus, and is a different beast than the briefs. It receives an entire week’s worth of already-synthesized daily items — typically thirty to forty-two items — and produces a four-to-six-thousand-word analytical document. This is long enough that streaming the response matters for both reliability and cost management.

The weekly prompt asks Opus to do things that Sonnet’s daily work can’t: trace narrative arcs across the week, identify cross-domain connections (how a trade policy story connects to a labor market story connects to a consumer confidence story), flag developing situations that no single daily brief captured, and note where confidence or agreement shifted between Monday and Saturday.

This is meta-analysis. The LLM isn’t working with raw sources; it’s working with already-assessed, trust-scored daily items. Uncertainties roll up cleanly, meaning if a daily item was flagged as “developing” confidence, the weekly assessment can note whether that uncertainty resolved or deepened as the week progressed.

The Structured Output Contract

Every LLM call in the pipeline uses structured output, which means the response must conform to a defined JSON schema. This isn’t just a nice-to-have; it’s what makes the pipeline reliable enough to run unattended.

The clustering prompt specifies exactly what fields each cluster should have: a label, source indices, sub-scores for each of the five significance dimensions, a composite score, editorial perspectives present, and selection status. The synthesis prompt specifies the brief item schema down to field types and nullable flags. The LLM doesn’t get to decide the shape of its output, only the content within a predetermined structure.

This is a lesson I’ve learned across multiple projects: LLMs are most reliable when you constrain their output format and give them freedom within those constraints. Tell them what to produce and let them decide what to say. The alternative, free-form text that you parse afterward, is fragile, inconsistent, and a debugging nightmare.

Bias Auditing

The pipeline includes a longitudinal bias detection system that audits historical output for systematic skew. That’s a fancy way of saying we try really, really hard to be unbiased. The bias detection system checks topic tag distribution (are certain topics consistently over- or underrepresented?), editorial perspective coverage (are stories from certain perspectives systematically excluded?), selection bias (do unselected clusters skew toward particular perspectives?), trust signal patterns (do certain topics consistently receive lower confidence?), and perspective diversity per story (are most stories seen through only one editorial lens?).

This isn’t a one-time check. It runs periodically against the accumulating output data, flagging any patterns that suggest the pipeline’s analytical judgments are drifting in a systematic direction. The flags are quantitative — threshold-based, not vibes-based. If right-leaning perspectives appear in clusters but those clusters get selected at a significantly lower rate than the baseline, that’s a flag. If a topic tag that should appear weekly is absent for two consecutive weeks, that’s a flag.

The bias auditor can’t prove the pipeline is unbiased — that’s an impossible standard. But it can detect systematic drift, which is the actionable concern. A pipeline that gradually skews over time can be corrected. A pipeline that’s never audited for skew will eventually drift without anyone noticing.

Error Handling: Fail Loud, Fail Safe

The pipeline has exactly one retry. If any stage fails, the system waits fifteen minutes and tries the entire pipeline again from the beginning. If the retry fails, the brief is skipped for that day.

There are no fallbacks. No “use yesterday’s brief with an updated date.” No “publish a partial brief with whatever succeeded.” The brief is either a complete, validated, internally consistent assessment — or it doesn’t exist for that day.

This philosophy is borrowed directly from the intelligence community. A briefing document that’s wrong is worse than no briefing document at all. The real PDB is occasionally late. It is never sloppy.

The GitHub Actions workflow that runs the pipeline sends no notification on success (success is the default state) and alerts on failure. The system is designed to be boring when it works, which is the kind of boring I’m happy to build.

And speaking of boring, if my tone seemed a bit more clinical in this article, it’s because this project occupies a unique slot in the Stalefish Labs portfolio. Building a daily news brief for citizens is serious business, so I’ve left most of the fun and games aside while working on this project. I want you to be able to trust it the same way I do. Feel free to reach out if you have any questions, suggestions, etc.

The Citizen’s Daily Brief is a free daily intelligence briefing from Stalefish Labs.