Layer 2 — Story Clustering
Layer 2 connects related articles across sources into unified stories, tracking how narratives develop over time — from breaking news through to concluded coverage.
What Layer 2 does
Most news APIs return articles. Layer 2 returns stories — coherent narratives assembled from multiple articles across multiple sources covering the same real-world event.
When fifteen outlets cover the same diplomatic development, Layer 2 groups them into one story cluster with a single ID, tracks when new coverage arrives, and surfaces metadata about source diversity, narrative velocity, and lifecycle state. Your application queries stories, not the flood of individual articles behind them.
How clustering works
Layer 2 runs three operations on every new article, in order:
- Idempotency check — if the article has already been assigned to a cluster in a previous batch or pipeline run, that assignment is returned immediately. No reprocessing occurs.
- Embedding similarity scan — cosine similarity is computed between the article's embedding and the average embedding of every active story cluster. Candidates clearing the similarity threshold (0.55) advance to stage three. Candidates scoring 0.65 or above skip stage three entirely and are accepted directly.
- Entity overlap verification — for candidates in the 0.55–0.64 range, shared named entities extracted by Layer 1 are compared. The candidate must share at least one entity (person, organization, or location) with the incoming article. The highest-scoring verified candidate receives the article. If no candidate qualifies, a new story cluster is created.
Story lifecycle
Layer 2 tracks each story through four lifecycle states based on article velocity and recency. State is
recalculated each time a story is analyzed via
POST /v1/story/{story_id}/analyze.
| State | Meaning | Conditions |
|---|---|---|
breaking |
Rapidly developing, very recent | >2 articles/hour, last update <2 hours ago |
developing |
Active coverage, moderate velocity | >0.5 articles/hour, last update <6 hours ago |
mature |
Slowing but recent | Last update within 48 hours |
concluded |
No recent activity | Last update >48 hours ago |
Source diversity scoring
Layer 2 categorizes each source into one of six types — tech, mainstream,
business, policy, analysis, other — and computes a
diversity score (0.0–1.0) for each story based on three weighted factors:
- Source breadth (40%) — number of unique sources, capped at 10.
- Category variety (40%) — number of distinct source types represented.
- Balance (20%) — how evenly coverage is distributed across source types.
| Score | Rating | Interpretation |
|---|---|---|
| ≥ 0.8 | Excellent | Broad, balanced multi-category coverage |
| 0.6–0.79 | Good | Multiple source types represented |
| 0.4–0.59 | Moderate | Some diversity, coverage skews one direction |
| < 0.4 | Limited | Narrow source base, potential coverage gap |
Understanding the two scores
Layer 2 surfaces two score fields that measure different things:
| Field | Scope | What it measures |
|---|---|---|
relevance_score |
Per article | Cosine similarity between that article's embedding and the cluster's average embedding at join time.
The founding article of every cluster is always 1.0. Subsequent articles are scored
against the evolving cluster centroid. |
confidence_score |
Per cluster | A composite cluster-quality metric:
(source_diversity × 0.5) + (size_score × 0.3) + (avg_quality × 0.2). More sources and
more articles push this score higher. New single-article clusters initialize at 1.0 by
convention; the real formula applies once a second article joins.
|
relevance_score answers "how well does this article fit its cluster?" —
confidence_score answers "how trustworthy is this cluster as a unified story?"
Representative article
Every cluster has a representative_article_id — set to the founding article when the cluster
is created and never updated. It is always the first article that triggered the cluster's existence, not
necessarily the highest-quality or most-read.
Use the representative article as a stable anchor for display (headline, thumbnail, link preview). To find
the highest-relevance article in a cluster, fetch /v1/story/{id}/articles and sort by
relevance_score descending.
Endpoints
https://layer2.api.polariapi.com. For example:
POST https://layer2.api.polariapi.com/v1/cluster
Cluster single article
Cluster a single article by ID. Useful for testing and one-off processing. For production pipelines, prefer the batch endpoint.
| Parameter | Type | Description |
|---|---|---|
| article_idrequired | string | ID of a Layer 1-processed article. |
Cluster articles (batch)
Submit multiple article IDs for clustering in a single request. Returns cluster assignments, full story metadata, and batch statistics. This is the primary endpoint called by the pipeline after Layer 1 processing.
| Field | Description |
|---|---|
| clusters[].article_ids | Article IDs from this batch assigned to this cluster — not all articles in the cluster. |
| clusters[].article_count | Total articles in the cluster across all pipeline runs. |
| clusters[].representative_id | The founding article — set at cluster creation, never changed. |
| clusters[].is_new | true if this cluster was created by this request. |
| singleton_ids | Articles that did not match any existing cluster. Each was assigned its own new cluster with
is_new: true.
|
| stats.already_clustered | Articles that had existing assignments — returned immediately via idempotency check, no reprocessing. |
| stats.clustering_rate | Fraction of submitted articles that joined an existing cluster (not newly created). |
Get story
Retrieve metadata for a single story cluster.
List stories
Paginate through active story clusters ordered by most recently updated.
| Parameter | Type | Description |
|---|---|---|
| limit | integer | Results per page. Default: 20. Max: 500. |
| offset | integer | Pagination offset. Default: 0. |
| Field | Description |
|---|---|
| avg_sentiment | Mean sentiment score across all scored articles in the cluster. Range −1.0 to
+1.0. null if no articles have been sentiment-scored yet.
|
| sentiment_distribution | Raw article counts by label — positive, neutral,
negative. null if no articles scored.
|
Get story articles
Returns all articles in a cluster, ordered by their position within the cluster.
| Field | Description |
|---|---|
| relevance_score | Cosine similarity to the cluster centroid at join time. The founding article is always
1.0.
|
| is_update | false for the founding article, true for all subsequent articles. Use
this to distinguish original reporting from follow-up coverage. |
| position | Order in which the article joined the cluster. Position 1 is the founding article. |
Story timeline
Returns the chronological development of a story — when each article was added, which sources joined, velocity over time, and automatically detected milestones.
| Milestone type | Description |
|---|---|
story_created |
First article added; cluster initialized. |
coverage_expansion |
Coverage reached a new multiple of 3 sources. |
activity_spike |
Velocity exceeded 2× the story's average rate. |
Source diversity analysis
Returns source breakdown, diversity score, and coverage gaps for a story cluster.
Analyze story
Recalculates and persists enriched metadata for a story — diversity score, lifecycle state, and article velocity.
article_count, source_count, confidence_score, and the
cluster's average embedding when articles are added. Richer metadata —
lifecycle_state, source_diversity_score, article_velocity — is only
written when this endpoint or the batch endpoint is called explicitly. Call it after clustering batches, or
run /v1/stories/analyze-batch on a schedule.
Batch analyze
Runs analysis across all active stories, updating diversity scores, lifecycle states, and velocity metrics. Intended as a periodic background job — hourly is typical.
| Parameter | Type | Description |
|---|---|---|
| limit | integer | Maximum stories to analyze per run. Default: 100. Stories are ordered by most recently
updated. |
Story relationships
Computes pairwise entity overlap between a set of story clusters. Returns weighted edges between clusters that share meaningful named entities — used to build relationship graphs between concurrent stories.
This endpoint powers the inter-topic arc rendering in constellation visualizations. It supplements Layer 3 relationship data when direct cluster-to-cluster edges are sparse.
| Parameter | Type | Description |
|---|---|---|
| idsrequired | string | Comma-separated list of cluster IDs. Min: 2. Max: 20. |
| min_shared | integer | Minimum shared entities required to return an edge. Default: 2. |
| Field | Description |
|---|---|
| confidence | Jaccard similarity of the two clusters' entity sets — shared entities divided by union of all
entities. Range 0.0 to 1.0. |
| shared_entities | Up to 10 normalized entity strings driving the connection. Common stop-words and boilerplate terms are filtered before scoring. |
| computed_from | Always layer1_entities — computed on demand from the JSONB entity fields on
member articles, not from pre-stored relationship records. |
Cross-briefing clustering
Layer 2 clusters persist across pipeline runs. An article processed in Monday's briefing and an article from Thursday's briefing can join the same story cluster if they meet the similarity and entity overlap thresholds.
This is the primary mechanism behind long-running story tracking. In production, clusters have accumulated 50 articles across 28 days from 8 distinct sources on a single story thread — without any manual intervention.
New articles are evaluated against the full pool of existing active clusters regardless of when those clusters were created. There is no time-based expiry on cluster candidacy.
Singletons
A singleton is a story cluster with exactly one article — an article that did not match any existing cluster at the time it was processed. Singletons are normal and expected: genuinely unique news events should not be forced into existing clusters.
Singletons become multi-article clusters when subsequent coverage of the same event arrives through later pipeline runs. A story that starts as a singleton at 9am may have 12 articles from 5 sources by end of day.
Singleton IDs are returned in the singleton_ids array of the batch clustering response. They
have valid cluster IDs and can be queried like any other story.
Performance
| Operation | Typical latency |
|---|---|
| Health check | 1ms |
GET /v1/story/{id} |
3ms |
GET /v1/story/{id}/articles |
3ms |
GET /v1/story/{id}/timeline |
3ms |
POST /v1/story/{id}/analyze |
7ms |
POST /v1/stories/analyze-batch (100 stories) |
495ms |
POST /v1/cluster/batch — already-clustered articles |
<2ms per article |
POST /v1/cluster/batch — new articles |
50–270ms per article |
Clustering cost for new articles scales with candidate pool size — each article's embedding is compared against all active story clusters. At current scale (~8,000 active stories) this remains well within latency targets. Already-clustered articles short-circuit immediately and add negligible overhead to batch calls.
Error responses
| Status | Cause |
|---|---|
404 |
Story or article ID not found. |
422 |
Missing or malformed request body — e.g. article_ids array absent from batch
request. |
500 |
Internal error. Most commonly caused by submitting an article that has not been processed through Layer 1 — it will have no embedding and cannot be clustered. |
Clustering parameters
| Parameter | Default | Description |
|---|---|---|
| Similarity threshold | 0.55 | Minimum cosine similarity for a cluster to enter the candidate pool. Lower values produce more aggressive clustering; higher values are stricter. |
| Entity overlap threshold | 1 | Minimum shared named entities required to confirm a candidate match. Only applied to candidates scoring below 0.65. |
| Max cluster size | 50 | Maximum articles per cluster. Clusters at capacity are excluded from the candidate search entirely. New articles that would have matched a full cluster will join a different qualifying cluster or create a new one — no error is returned. |