Docs
Early Access hello@polariapi.com
The Pipeline

Layer 2 — Story Clustering

Layer 2 connects related articles across sources into unified stories, tracking how narratives develop over time — from breaking news through to concluded coverage.


What Layer 2 does

Most news APIs return articles. Layer 2 returns stories — coherent narratives assembled from multiple articles across multiple sources covering the same real-world event.

When fifteen outlets cover the same diplomatic development, Layer 2 groups them into one story cluster with a single ID, tracks when new coverage arrives, and surfaces metadata about source diversity, narrative velocity, and lifecycle state. Your application queries stories, not the flood of individual articles behind them.

Prerequisite. Layer 2 operates on articles that have been processed by Layer 0 (embeddings + quality scoring) and Layer 1 (entity extraction + sentence semantics). Articles missing an embedding cannot be clustered and will return a 500 error.

How clustering works

Layer 2 runs three operations on every new article, in order:

  1. Idempotency check — if the article has already been assigned to a cluster in a previous batch or pipeline run, that assignment is returned immediately. No reprocessing occurs.
  2. Embedding similarity scan — cosine similarity is computed between the article's embedding and the average embedding of every active story cluster. Candidates clearing the similarity threshold (0.55) advance to stage three. Candidates scoring 0.65 or above skip stage three entirely and are accepted directly.
  3. Entity overlap verification — for candidates in the 0.55–0.64 range, shared named entities extracted by Layer 1 are compared. The candidate must share at least one entity (person, organization, or location) with the incoming article. The highest-scoring verified candidate receives the article. If no candidate qualifies, a new story cluster is created.

Story lifecycle

Layer 2 tracks each story through four lifecycle states based on article velocity and recency. State is recalculated each time a story is analyzed via POST /v1/story/{story_id}/analyze.

State Meaning Conditions
breaking Rapidly developing, very recent >2 articles/hour, last update <2 hours ago
developing Active coverage, moderate velocity >0.5 articles/hour, last update <6 hours ago
mature Slowing but recent Last update within 48 hours
concluded No recent activity Last update >48 hours ago

Source diversity scoring

Layer 2 categorizes each source into one of six types — tech, mainstream, business, policy, analysis, other — and computes a diversity score (0.0–1.0) for each story based on three weighted factors:

  • Source breadth (40%) — number of unique sources, capped at 10.
  • Category variety (40%) — number of distinct source types represented.
  • Balance (20%) — how evenly coverage is distributed across source types.
Score Rating Interpretation
≥ 0.8 Excellent Broad, balanced multi-category coverage
0.6–0.79 Good Multiple source types represented
0.4–0.59 Moderate Some diversity, coverage skews one direction
< 0.4 Limited Narrow source base, potential coverage gap

Understanding the two scores

Layer 2 surfaces two score fields that measure different things:

Field Scope What it measures
relevance_score Per article Cosine similarity between that article's embedding and the cluster's average embedding at join time. The founding article of every cluster is always 1.0. Subsequent articles are scored against the evolving cluster centroid.
confidence_score Per cluster A composite cluster-quality metric: (source_diversity × 0.5) + (size_score × 0.3) + (avg_quality × 0.2). More sources and more articles push this score higher. New single-article clusters initialize at 1.0 by convention; the real formula applies once a second article joins.

relevance_score answers "how well does this article fit its cluster?" — confidence_score answers "how trustworthy is this cluster as a unified story?"


Representative article

Every cluster has a representative_article_id — set to the founding article when the cluster is created and never updated. It is always the first article that triggered the cluster's existence, not necessarily the highest-quality or most-read.

Use the representative article as a stable anchor for display (headline, thumbnail, link preview). To find the highest-relevance article in a cluster, fetch /v1/story/{id}/articles and sort by relevance_score descending.


Endpoints

Base URL. All Layer 2 endpoints are served from https://layer2.api.polariapi.com. For example: POST https://layer2.api.polariapi.com/v1/cluster

Cluster single article

POST /v1/cluster

Cluster a single article by ID. Useful for testing and one-off processing. For production pipelines, prefer the batch endpoint.

Parameter Type Description
article_idrequired string ID of a Layer 1-processed article.
RESPONSE
{ "cluster_id": "clus_9x3k2m8f", "confidence": 0.87, "is_new": false, "already_clustered": false }

Cluster articles (batch)

POST /v1/cluster/batch

Submit multiple article IDs for clustering in a single request. Returns cluster assignments, full story metadata, and batch statistics. This is the primary endpoint called by the pipeline after Layer 1 processing.

REQUEST
{ "article_ids": ["art_abc123", "art_def456", "art_ghi789"] }
RESPONSE
{ "clusters": [ { "cluster_id": "clus_9x3k2m8f", "title": "Trump warns Iran time running out for negotiations", "article_ids": ["art_abc123", "art_def456"], "representative_id": "art_abc123", "article_count": 50, "confidence": 0.94, "is_new": false, "source_count": 3, "first_seen": "2026-02-01T11:00:59Z", "last_updated": "2026-02-28T23:00:19Z" } ], "singleton_ids": ["art_ghi789"], "stats": { "total_articles": 3, "clusters_formed": 1, "articles_clustered": 2, "singleton_articles": 1, "already_clustered": 0, "clustering_rate": 0.67 } }
Field Description
clusters[].article_ids Article IDs from this batch assigned to this cluster — not all articles in the cluster.
clusters[].article_count Total articles in the cluster across all pipeline runs.
clusters[].representative_id The founding article — set at cluster creation, never changed.
clusters[].is_new true if this cluster was created by this request.
singleton_ids Articles that did not match any existing cluster. Each was assigned its own new cluster with is_new: true.
stats.already_clustered Articles that had existing assignments — returned immediately via idempotency check, no reprocessing.
stats.clustering_rate Fraction of submitted articles that joined an existing cluster (not newly created).

Get story

GET /v1/story/{story_id}

Retrieve metadata for a single story cluster.

RESPONSE
{ "id": "clus_9x3k2m8f", "title": "Trump warns Iran time running out for negotiations", "article_count": 50, "source_count": 3, "confidence_score": 0.94, "first_seen": "2026-02-01T11:00:59Z", "last_updated": "2026-02-28T23:00:19Z", "status": "active", "avg_sentiment": -0.1056, "sentiment_distribution": { "positive": 0, "neutral": 44, "negative": 9 } }

List stories

GET /v1/stories

Paginate through active story clusters ordered by most recently updated.

Parameter Type Description
limit integer Results per page. Default: 20. Max: 500.
offset integer Pagination offset. Default: 0.
RESPONSE
{ "stories": [ { "id": "clus_9x3k2m8f", "title": "US evacuates Beirut embassy staff as Iran war looms", "article_count": 53, "source_count": 8, "confidence_score": 0.91, "first_seen": "2026-05-01T09:14:00Z", "last_updated": "2026-05-26T14:20:00Z", "avg_sentiment": -0.1056, "sentiment_distribution": { "positive": 0, "neutral": 44, "negative": 9 } } ], "count": 20 }
Field Description
avg_sentiment Mean sentiment score across all scored articles in the cluster. Range −1.0 to +1.0. null if no articles have been sentiment-scored yet.
sentiment_distribution Raw article counts by label — positive, neutral, negative. null if no articles scored.

Get story articles

GET /v1/story/{story_id}/articles

Returns all articles in a cluster, ordered by their position within the cluster.

RESPONSE
{ "story_id": "clus_9x3k2m8f", "count": 2, "articles": [ { "id": "art_abc123", "title": "Trump warns Iran time running out for nuclear deal", "source": "Reuters", "published_date": "2026-02-01T11:00:00Z", "relevance_score": 1.0, "position": 1, "is_update": false }, { "id": "art_def456", "title": "Iran deal 'largely negotiated', Trump says", "source": "Bloomberg", "published_date": "2026-02-14T09:30:00Z", "relevance_score": 0.87, "position": 2, "is_update": true } ] }
Field Description
relevance_score Cosine similarity to the cluster centroid at join time. The founding article is always 1.0.
is_update false for the founding article, true for all subsequent articles. Use this to distinguish original reporting from follow-up coverage.
position Order in which the article joined the cluster. Position 1 is the founding article.

Story timeline

GET /v1/story/{story_id}/timeline

Returns the chronological development of a story — when each article was added, which sources joined, velocity over time, and automatically detected milestones.

RESPONSE
{ "story_id": "clus_9x3k2m8f", "title": "Trump warns Iran time running out for negotiations", "timeline": { "first_seen": "2026-02-01T11:00:59Z", "last_updated": "2026-02-28T23:00:19Z", "total_articles": 50, "total_sources": 3, "duration_hours": 660.0 }, "events": [ { "sequence": 1, "article_id": "art_abc123", "title": "Trump warns Iran time running out for nuclear deal", "source": "Reuters", "joined_at": "2026-02-01T11:00:59Z", "is_update": false, "is_new_source": true, "relevance_score": 1.0 } ], "velocity_timeline": [ { "timestamp": "2026-02-01T11:00:59Z", "velocity": 0.5 } ], "milestones": [ { "type": "story_created", "timestamp": "2026-02-01T11:00:59Z", "description": "Story created with: Trump warns Iran time running out..." }, { "type": "coverage_expansion", "timestamp": "2026-02-08T14:22:00Z", "description": "Coverage expanded to 3 sources" } ] }
Milestone type Description
story_created First article added; cluster initialized.
coverage_expansion Coverage reached a new multiple of 3 sources.
activity_spike Velocity exceeded 2× the story's average rate.

Source diversity analysis

GET /v1/story/{story_id}/sources

Returns source breakdown, diversity score, and coverage gaps for a story cluster.

RESPONSE
{ "story_id": "clus_9x3k2m8f", "total_sources": 3, "diversity_score": 0.42, "diversity_rating": "moderate", "source_distribution": { "mainstream": { "count": 2, "sources": ["Reuters", "Bloomberg"], "percentage": 66.7 }, "business": { "count": 1, "sources": ["Financial Times"], "percentage": 33.3 } }, "coverage_gaps": ["tech", "policy", "analysis"], "all_sources": ["Bloomberg", "Financial Times", "Reuters"] }

Analyze story

POST /v1/story/{story_id}/analyze

Recalculates and persists enriched metadata for a story — diversity score, lifecycle state, and article velocity.

Not called automatically by the clustering pipeline. The core clustering worker only updates article_count, source_count, confidence_score, and the cluster's average embedding when articles are added. Richer metadata — lifecycle_state, source_diversity_score, article_velocity — is only written when this endpoint or the batch endpoint is called explicitly. Call it after clustering batches, or run /v1/stories/analyze-batch on a schedule.
RESPONSE
{ "story_id": "clus_9x3k2m8f", "updated": true, "metrics": { "diversity_score": 0.42, "lifecycle_state": "concluded", "velocity": 0.076, "unique_sources": 3 } }

Batch analyze

POST /v1/stories/analyze-batch

Runs analysis across all active stories, updating diversity scores, lifecycle states, and velocity metrics. Intended as a periodic background job — hourly is typical.

Parameter Type Description
limit integer Maximum stories to analyze per run. Default: 100. Stories are ordered by most recently updated.
RESPONSE
{ "analyzed": 100, "updated": 100, "errors": 0, "error_details": [] }

Story relationships

GET /v1/stories/relationships

Computes pairwise entity overlap between a set of story clusters. Returns weighted edges between clusters that share meaningful named entities — used to build relationship graphs between concurrent stories.

This endpoint powers the inter-topic arc rendering in constellation visualizations. It supplements Layer 3 relationship data when direct cluster-to-cluster edges are sparse.

Parameter Type Description
idsrequired string Comma-separated list of cluster IDs. Min: 2. Max: 20.
min_shared integer Minimum shared entities required to return an edge. Default: 2.
BASH
curl "https://layer2.api.polariapi.com/v1/stories/relationships?ids=clus_9x3k2m8f,clus_7d9e622d&min_shared=3" -H "Authorization: Bearer pk_live_your_key"
RESPONSE
{ "cluster_count": 2, "relationships": [ { "source": "clus_9x3k2m8f", "target": "clus_7d9e622d", "confidence": 0.158, "type": "shared_entities", "shared_count": 81, "shared_entities": [ "ali larijani", "ayatollah ali khamenei", "benjamin netanyahu" ] } ], "computed_from": "layer1_entities" }
Field Description
confidence Jaccard similarity of the two clusters' entity sets — shared entities divided by union of all entities. Range 0.0 to 1.0.
shared_entities Up to 10 normalized entity strings driving the connection. Common stop-words and boilerplate terms are filtered before scoring.
computed_from Always layer1_entities — computed on demand from the JSONB entity fields on member articles, not from pre-stored relationship records.

Cross-briefing clustering

Layer 2 clusters persist across pipeline runs. An article processed in Monday's briefing and an article from Thursday's briefing can join the same story cluster if they meet the similarity and entity overlap thresholds.

This is the primary mechanism behind long-running story tracking. In production, clusters have accumulated 50 articles across 28 days from 8 distinct sources on a single story thread — without any manual intervention.

New articles are evaluated against the full pool of existing active clusters regardless of when those clusters were created. There is no time-based expiry on cluster candidacy.


Singletons

A singleton is a story cluster with exactly one article — an article that did not match any existing cluster at the time it was processed. Singletons are normal and expected: genuinely unique news events should not be forced into existing clusters.

Singletons become multi-article clusters when subsequent coverage of the same event arrives through later pipeline runs. A story that starts as a singleton at 9am may have 12 articles from 5 sources by end of day.

Singleton IDs are returned in the singleton_ids array of the batch clustering response. They have valid cluster IDs and can be queried like any other story.


Performance

Operation Typical latency
Health check 1ms
GET /v1/story/{id} 3ms
GET /v1/story/{id}/articles 3ms
GET /v1/story/{id}/timeline 3ms
POST /v1/story/{id}/analyze 7ms
POST /v1/stories/analyze-batch (100 stories) 495ms
POST /v1/cluster/batch — already-clustered articles <2ms per article
POST /v1/cluster/batch — new articles 50–270ms per article

Clustering cost for new articles scales with candidate pool size — each article's embedding is compared against all active story clusters. At current scale (~8,000 active stories) this remains well within latency targets. Already-clustered articles short-circuit immediately and add negligible overhead to batch calls.


Error responses

Status Cause
404 Story or article ID not found.
422 Missing or malformed request body — e.g. article_ids array absent from batch request.
500 Internal error. Most commonly caused by submitting an article that has not been processed through Layer 1 — it will have no embedding and cannot be clustered.
ERROR
{ "detail": "Article art_abc123 not found or missing embedding" }

Clustering parameters

Parameter Default Description
Similarity threshold 0.55 Minimum cosine similarity for a cluster to enter the candidate pool. Lower values produce more aggressive clustering; higher values are stricter.
Entity overlap threshold 1 Minimum shared named entities required to confirm a candidate match. Only applied to candidates scoring below 0.65.
Max cluster size 50 Maximum articles per cluster. Clusters at capacity are excluded from the candidate search entirely. New articles that would have matched a full cluster will join a different qualifying cluster or create a new one — no error is returned.
These parameters are set at the platform level and are not configurable per API key. Contact hello@polariapi.com if your use case requires custom thresholds.