The Pipeline

Layer 2 — Story Clustering

Layer 2 connects related articles across sources into unified stories, tracking how narratives develop over time — from breaking news through to concluded coverage.

What Layer 2 does

Most news APIs return articles. Layer 2 returns stories — coherent narratives assembled from multiple articles across multiple sources covering the same real-world event.

When fifteen outlets cover the same diplomatic development, Layer 2 groups them into one story cluster with a single ID, tracks when new coverage arrives, and surfaces metadata about source diversity, narrative velocity, and lifecycle state. Your application queries stories, not the flood of individual articles behind them.

Prerequisite. Layer 2 operates on articles that have been processed by Layer 0 (embeddings + quality scoring) and Layer 1 (entity extraction + sentence semantics). Articles missing an embedding cannot be clustered and will return a 500 error.

How clustering works

Layer 2 runs three operations on every new article, in order:

Idempotency check — if the article has already been assigned to a cluster in a previous batch or pipeline run, that assignment is returned immediately. No reprocessing occurs.
Embedding similarity scan — cosine similarity is computed between the article's embedding and the average embedding of every active story cluster. Candidates clearing the similarity threshold (0.55) advance to stage three. Candidates scoring 0.65 or above skip stage three entirely and are accepted directly.
Entity overlap verification — for candidates in the 0.55–0.64 range, shared named entities extracted by Layer 1 are compared. The candidate must share at least one entity (person, organization, or location) with the incoming article. The highest-scoring verified candidate receives the article. If no candidate qualifies, a new story cluster is created.

Story lifecycle

Layer 2 tracks each story through four lifecycle states based on article velocity and recency. State is recalculated each time a story is analyzed via POST /v1/story/{story_id}/analyze.

State	Meaning	Conditions
`breaking`	Rapidly developing, very recent	>2 articles/hour, last update <2 hours ago
`developing`	Active coverage, moderate velocity	>0.5 articles/hour, last update <6 hours ago
`mature`	Slowing but recent	Last update within 48 hours
`concluded`	No recent activity	Last update >48 hours ago

Source diversity scoring

Layer 2 categorizes each source into one of six types — tech, mainstream, business, policy, analysis, other — and computes a diversity score (0.0–1.0) for each story based on three weighted factors:

Source breadth (40%) — number of unique sources, capped at 10.
Category variety (40%) — number of distinct source types represented.
Balance (20%) — how evenly coverage is distributed across source types.

Score	Rating	Interpretation
≥ 0.8	Excellent	Broad, balanced multi-category coverage
0.6–0.79	Good	Multiple source types represented
0.4–0.59	Moderate	Some diversity, coverage skews one direction
< 0.4	Limited	Narrow source base, potential coverage gap

Understanding the two scores

Layer 2 surfaces two score fields that measure different things:

Field	Scope	What it measures
`relevance_score`	Per article	Cosine similarity between that article's embedding and the cluster's average embedding at join time. The founding article of every cluster is always `1.0`. Subsequent articles are scored against the evolving cluster centroid.
`confidence_score`	Per cluster	A composite cluster-quality metric: `(source_diversity × 0.5) + (size_score × 0.3) + (avg_quality × 0.2)`. More sources and more articles push this score higher. New single-article clusters initialize at `1.0` by convention; the real formula applies once a second article joins.

relevance_score answers "how well does this article fit its cluster?" — confidence_score answers "how trustworthy is this cluster as a unified story?"

Representative article

Every cluster has a representative_article_id — set to the founding article when the cluster is created and never updated. It is always the first article that triggered the cluster's existence, not necessarily the highest-quality or most-read.

Use the representative article as a stable anchor for display (headline, thumbnail, link preview). To find the highest-relevance article in a cluster, fetch /v1/story/{id}/articles and sort by relevance_score descending.

Endpoints

Base URL. All Layer 2 endpoints are served from https://layer2.api.polariapi.com. For example: POST https://layer2.api.polariapi.com/v1/cluster

Cluster single article

POST /v1/cluster

Cluster a single article by ID. Useful for testing and one-off processing. For production pipelines, prefer the batch endpoint.

Parameter	Type	Description
article_idrequired	string	ID of a Layer 1-processed article.

RESPONSE
{
            "cluster_id": "clus_9x3k2m8f",
            "confidence": 0.87,
            "is_new": false,
            "already_clustered": false
            }
          

Cluster articles (batch)

POST /v1/cluster/batch

Submit multiple article IDs for clustering in a single request. Returns cluster assignments, full story metadata, and batch statistics. This is the primary endpoint called by the pipeline after Layer 1 processing.

REQUEST

{
            "article_ids": ["art_abc123", "art_def456", "art_ghi789"]
            }
          

RESPONSE
{
            "clusters": [
            {
            "cluster_id": "clus_9x3k2m8f",
            "title": "Trump warns Iran time running out for
              negotiations",
            "article_ids": ["art_abc123", "art_def456"],
            "representative_id": "art_abc123",
            "article_count": 50,
            "confidence": 0.94,
            "is_new": false,
            "source_count": 3,
            "first_seen": "2026-02-01T11:00:59Z",
            "last_updated": "2026-02-28T23:00:19Z"
            }
            ],
            "singleton_ids": ["art_ghi789"],
            "stats": {
            "total_articles": 3,
            "clusters_formed": 1,
            "articles_clustered": 2,
            "singleton_articles": 1,
            "already_clustered": 0,
            "clustering_rate": 0.67
            }
            }
          

Field	Description
clusters[].article_ids	Article IDs from this batch assigned to this cluster — not all articles in the cluster.
clusters[].article_count	Total articles in the cluster across all pipeline runs.
clusters[].representative_id	The founding article — set at cluster creation, never changed.
clusters[].is_new	`true` if this cluster was created by this request.
singleton_ids	Articles that did not match any existing cluster. Each was assigned its own new cluster with `is_new: true`.
stats.already_clustered	Articles that had existing assignments — returned immediately via idempotency check, no reprocessing.
stats.clustering_rate	Fraction of submitted articles that joined an existing cluster (not newly created).

Get story

GET /v1/story/{story_id}

Retrieve metadata for a single story cluster.

RESPONSE
{
            "id": "clus_9x3k2m8f",
            "title": "Trump warns Iran time running out for
              negotiations",
            "article_count": 50,
            "source_count": 3,
            "confidence_score": 0.94,
            "first_seen": "2026-02-01T11:00:59Z",
            "last_updated": "2026-02-28T23:00:19Z",
            "status": "active",
            "avg_sentiment": -0.1056,
            "sentiment_distribution": {
            "positive": 0,
            "neutral": 44,
            "negative": 9
            }
            }
          

List stories

GET /v1/stories

Paginate through active story clusters ordered by most recently updated.

Parameter	Type	Description
limit	integer	Results per page. Default: `20`. Max: `500`.
offset	integer	Pagination offset. Default: `0`.

RESPONSE
{
            "stories": [
            {
            "id": "clus_9x3k2m8f",
            "title": "US evacuates Beirut embassy staff as Iran war
              looms",
            "article_count": 53,
            "source_count": 8,
            "confidence_score": 0.91,
            "first_seen": "2026-05-01T09:14:00Z",
            "last_updated": "2026-05-26T14:20:00Z",
            "avg_sentiment": -0.1056,
            "sentiment_distribution": {
            "positive": 0,
            "neutral": 44,
            "negative": 9
            }
            }
            ],
            "count": 20
            }
          

Field	Description
avg_sentiment	Mean sentiment score across all scored articles in the cluster. Range `−1.0` to `+1.0`. `null` if no articles have been sentiment-scored yet.
sentiment_distribution	Raw article counts by label — `positive`, `neutral`, `negative`. `null` if no articles scored.

Get story articles

GET /v1/story/{story_id}/articles

Returns all articles in a cluster, ordered by their position within the cluster.

RESPONSE
{
            "story_id": "clus_9x3k2m8f",
            "count": 2,
            "articles": [
            {
            "id": "art_abc123",
            "title": "Trump warns Iran time running out for nuclear
              deal",
            "source": "Reuters",
            "published_date": "2026-02-01T11:00:00Z",
            "relevance_score": 1.0,
            "position": 1,
            "is_update": false
            },
            {
            "id": "art_def456",
            "title": "Iran deal 'largely negotiated', Trump says",
            "source": "Bloomberg",
            "published_date": "2026-02-14T09:30:00Z",
            "relevance_score": 0.87,
            "position": 2,
            "is_update": true
            }
            ]
            }
          

Field	Description
relevance_score	Cosine similarity to the cluster centroid at join time. The founding article is always `1.0`.
is_update	`false` for the founding article, `true` for all subsequent articles. Use this to distinguish original reporting from follow-up coverage.
position	Order in which the article joined the cluster. Position 1 is the founding article.

Story timeline

GET /v1/story/{story_id}/timeline

Returns the chronological development of a story — when each article was added, which sources joined, velocity over time, and automatically detected milestones.

RESPONSE
{
            "story_id": "clus_9x3k2m8f",
            "title": "Trump warns Iran time running out for
              negotiations",
            "timeline": {
            "first_seen": "2026-02-01T11:00:59Z",
            "last_updated": "2026-02-28T23:00:19Z",
            "total_articles": 50,
            "total_sources": 3,
            "duration_hours": 660.0
            },
            "events": [
            {
            "sequence": 1,
            "article_id": "art_abc123",
            "title": "Trump warns Iran time running out for nuclear
              deal",
            "source": "Reuters",
            "joined_at": "2026-02-01T11:00:59Z",
            "is_update": false,
            "is_new_source": true,
            "relevance_score": 1.0
            }
            ],
            "velocity_timeline": [
            { "timestamp": "2026-02-01T11:00:59Z", "velocity": 0.5 }
            ],
            "milestones": [
            {
            "type": "story_created",
            "timestamp": "2026-02-01T11:00:59Z",
            "description": "Story created with: Trump warns Iran time
              running out..."
            },
            {
            "type": "coverage_expansion",
            "timestamp": "2026-02-08T14:22:00Z",
            "description": "Coverage expanded to 3 sources"
            }
            ]
            }
          

Milestone type	Description
`story_created`	First article added; cluster initialized.
`coverage_expansion`	Coverage reached a new multiple of 3 sources.
`activity_spike`	Velocity exceeded 2× the story's average rate.

Source diversity analysis

GET /v1/story/{story_id}/sources

Returns source breakdown, diversity score, and coverage gaps for a story cluster.

RESPONSE
{
            "story_id": "clus_9x3k2m8f",
            "total_sources": 3,
            "diversity_score": 0.42,
            "diversity_rating": "moderate",
            "source_distribution": {
            "mainstream": {
            "count": 2,
            "sources": ["Reuters", "Bloomberg"],
            "percentage": 66.7
            },
            "business": {
            "count": 1,
            "sources": ["Financial Times"],
            "percentage": 33.3
            }
            },
            "coverage_gaps": ["tech", "policy", "analysis"],
            "all_sources": ["Bloomberg", "Financial Times", "Reuters"]
            }
          

Analyze story

POST /v1/story/{story_id}/analyze

Recalculates and persists enriched metadata for a story — diversity score, lifecycle state, and article velocity.

Not called automatically by the clustering pipeline. The core clustering worker only updates article_count, source_count, confidence_score, and the cluster's average embedding when articles are added. Richer metadata — lifecycle_state, source_diversity_score, article_velocity — is only written when this endpoint or the batch endpoint is called explicitly. Call it after clustering batches, or run /v1/stories/analyze-batch on a schedule.

RESPONSE
{
            "story_id": "clus_9x3k2m8f",
            "updated": true,
            "metrics": {
            "diversity_score": 0.42,
            "lifecycle_state": "concluded",
            "velocity": 0.076,
            "unique_sources": 3
            }
            }
          

Batch analyze

POST /v1/stories/analyze-batch

Runs analysis across all active stories, updating diversity scores, lifecycle states, and velocity metrics. Intended as a periodic background job — hourly is typical.

Parameter	Type	Description
limit	integer	Maximum stories to analyze per run. Default: `100`. Stories are ordered by most recently updated.

RESPONSE
{
            "analyzed": 100,
            "updated": 100,
            "errors": 0,
            "error_details": []
            }
          

Story relationships

GET /v1/stories/relationships

Computes pairwise entity overlap between a set of story clusters. Returns weighted edges between clusters that share meaningful named entities — used to build relationship graphs between concurrent stories.

This endpoint powers the inter-topic arc rendering in constellation visualizations. It supplements Layer 3 relationship data when direct cluster-to-cluster edges are sparse.

Parameter	Type	Description
idsrequired	string	Comma-separated list of cluster IDs. Min: `2`. Max: `20`.
min_shared	integer	Minimum shared entities required to return an edge. Default: `2`.

BASH

curl "https://layer2.api.polariapi.com/v1/stories/relationships?ids=clus_9x3k2m8f,clus_7d9e622d&min_shared=3"

            -H "Authorization: Bearer pk_live_your_key"

RESPONSE
{
            "cluster_count": 2,
            "relationships": [
            {
            "source": "clus_9x3k2m8f",
            "target": "clus_7d9e622d",
            "confidence": 0.158,
            "type": "shared_entities",
            "shared_count": 81,
            "shared_entities": [
            "ali larijani",
            "ayatollah ali khamenei",
            "benjamin netanyahu"
            ]
            }
            ],
            "computed_from": "layer1_entities"
            }
          

Field	Description
confidence	Jaccard similarity of the two clusters' entity sets — shared entities divided by union of all entities. Range `0.0` to `1.0`.
shared_entities	Up to 10 normalized entity strings driving the connection. Common stop-words and boilerplate terms are filtered before scoring.
computed_from	Always `layer1_entities` — computed on demand from the JSONB entity fields on member articles, not from pre-stored relationship records.

Cross-briefing clustering

Layer 2 clusters persist across pipeline runs. An article processed in Monday's briefing and an article from Thursday's briefing can join the same story cluster if they meet the similarity and entity overlap thresholds.

This is the primary mechanism behind long-running story tracking. In production, clusters have accumulated 50 articles across 28 days from 8 distinct sources on a single story thread — without any manual intervention.

New articles are evaluated against the full pool of existing active clusters regardless of when those clusters were created. There is no time-based expiry on cluster candidacy.

Singletons

A singleton is a story cluster with exactly one article — an article that did not match any existing cluster at the time it was processed. Singletons are normal and expected: genuinely unique news events should not be forced into existing clusters.

Singletons become multi-article clusters when subsequent coverage of the same event arrives through later pipeline runs. A story that starts as a singleton at 9am may have 12 articles from 5 sources by end of day.

Singleton IDs are returned in the singleton_ids array of the batch clustering response. They have valid cluster IDs and can be queried like any other story.

Performance

Operation	Typical latency
Health check	1ms
`GET /v1/story/{id}`	3ms
`GET /v1/story/{id}/articles`	3ms
`GET /v1/story/{id}/timeline`	3ms
`POST /v1/story/{id}/analyze`	7ms
`POST /v1/stories/analyze-batch` (100 stories)	495ms
`POST /v1/cluster/batch` — already-clustered articles	<2ms per article
`POST /v1/cluster/batch` — new articles	50–270ms per article

Clustering cost for new articles scales with candidate pool size — each article's embedding is compared against all active story clusters. At current scale (~8,000 active stories) this remains well within latency targets. Already-clustered articles short-circuit immediately and add negligible overhead to batch calls.

Error responses

Status	Cause
`404`	Story or article ID not found.
`422`	Missing or malformed request body — e.g. `article_ids` array absent from batch request.
`500`	Internal error. Most commonly caused by submitting an article that has not been processed through Layer 1 — it will have no embedding and cannot be clustered.

ERROR
{
            "detail": "Article art_abc123 not found or missing
              embedding"
            }
          

Clustering parameters

Parameter	Default	Description
Similarity threshold	0.55	Minimum cosine similarity for a cluster to enter the candidate pool. Lower values produce more aggressive clustering; higher values are stricter.
Entity overlap threshold	1	Minimum shared named entities required to confirm a candidate match. Only applied to candidates scoring below 0.65.
Max cluster size	50	Maximum articles per cluster. Clusters at capacity are excluded from the candidate search entirely. New articles that would have matched a full cluster will join a different qualifying cluster or create a new one — no error is returned.

These parameters are set at the platform level and are not configurable per API key. Contact hello@polariapi.com if your use case requires custom thresholds.

← Previous

Layer 1 — Semantic Analysis

Layer 3 — Intelligence Graph