The Pipeline

Layer 1 — Semantic Analysis

Layer 1 extracts structured meaning from every article: named entities, sentence-level embeddings, locations, and sentiment. Results are cached — repeat calls return instantly.

What Layer 1 does

Layer 1 runs six operations on every article, in order:

Sentence segmentation — splits article body into individual sentences for fine-grained analysis.
Named entity recognition — extracts people, organizations, locations, events, dates, and monetary values using spaCy.
Sentence embeddings — generates a 384-dimensional vector per sentence using all-MiniLM-L6-v2, enabling sentence-level semantic search.
Article embedding — produces a single 384-dimensional article-level vector aggregated from sentence embeddings.
Geographic and temporal tagging — extracts location mentions and date references for geo and timeline queries.
Sentiment scoring — scores each article on a −1.0 to +1.0 scale using cardiffnlp/twitter-roberta-base-sentiment-latest, a RoBERTa model fine-tuned on 130M tweets. Returns a signed float and a positive / neutral / negative label.

Entity types

Layer 1 currently extracts five entity types, all available as search filters. DATE and MONEY extraction is on the roadmap.

Type	Description	Example
`PERSON`	Named individuals	Jerome Powell, Andy Jassy
`ORG`	Companies, agencies, institutions	Federal Reserve, AWS, Tesla
`GPE`	Geopolitical entities — countries, cities, states	United States, Beijing, Texas
`LOC`	Non-political locations	Pacific Ocean, Strait of Hormuz
`EVENT`	Named events	G7 Summit, Super Bowl
`DATE`	Temporal references — coming soon	last Tuesday, Q3 2026
`MONEY`	Monetary values — coming soon	$4.2 billion, €500M

Known limitation. spaCy occasionally tags a well-known product name as the entity rather than the parent organization — for example, "Falcon 9" (ORG) instead of "SpaceX". This is expected behavior for en_core_web_sm and affects a small fraction of extractions. Overall entity recall is 93.8% on benchmark test cases.

Performance

Metric	Value
Single article p50 (short, ~150 chars)	52ms
Single article p50 (medium, ~500 chars)	94ms
Single article p50 (long, ~1200 chars)	147ms
Batch throughput (batch size 25)	18.5 articles/sec
Concurrent throughput (4 workers, warm)	20.1 articles/sec
Entity extraction recall	93.8%
Cache hit latency	<1ms

Cache behavior. Results are stored in PostgreSQL after first processing. Subsequent calls for the same article_id return cached results instantly with processing_time_ms: 0.The sentences array is omitted from cached responses. This is a known limitation — sentence data is stored in ChromaDB and will be returned from cache in a future update.

Endpoints

Base URL. All Layer 1 endpoints are served from https://layer1.api.polariapi.com. For example: POST https://layer1.api.polariapi.com/v1/process

Process article

POST /v1/process

Process a single article through the Layer 1 pipeline. Returns entities, sentence embeddings, locations, and an article-level embedding. Results are cached by article_id.

Field	Type	Description
article_idrequired	string	Unique identifier — used for caching and DB storage. Should match the `article_id` from Layer 0.
title	string	Article headline — included in entity extraction.
contentrequired	string	Article body text.
url	string	Article URL — stored for reference.
published_date	ISO 8601	Original publication datetime — used for timeline queries.

REQUEST

{
            "article_id": "art_8f7h2k9s",
            "title": "Fed Holds Rates Steady",
            "content": "The Federal Reserve held interest rates steady on
              Wednesday, with Chair Jerome Powell signaling patience...",
            "url": "https://reuters.com/fed-rates-2026",
            "published_date": "2026-04-29T12:00:00"
            }
          

RESPONSE
{
            "success": true,
            "article_id": "art_8f7h2k9s",
            "processed_at": "2026-04-29T12:00:01Z",
            "stats": {
            "sentence_count": 14,
            "entity_count": 8,
            "location_count": 2,
            "embedding_dim": 384,
            "processing_time_ms": 94.3
            },
            "entities": {
            "PERSON": ["Jerome Powell"],
            "ORG": ["Federal Reserve", "FOMC"],
            "GPE": ["United States"]
            },
            "locations": ["United States", "Washington"],
            "sentiment_score": -0.6841,
            "sentiment_label": "negative",
            "article_embedding": [0.023, -0.114, 0.087, /* 384 floats
              */],
            "semantic_hash": "a3f8c2e1...",
            "sentences": [
            {
            "text": "The Federal Reserve held interest rates steady on
              Wednesday...",
            "embedding": [0.031, /* 384 floats
              */]
            }
            ]
            }
          

Field	Description
stats.sentence_count	Number of sentences extracted from the article.
stats.entity_count	Total named entities extracted across all types.
stats.embedding_dim	Embedding dimensionality — always 384 for Layer 1.
stats.processing_time_ms	Server-side processing time. Returns `0` on cache hit.
entities	Dict keyed by entity type. Each value is an array of unique entity strings found in the article.
locations	Deduplicated list of GPE and LOC entities — convenience field for geographic queries.
article_embedding	384-dimensional float array representing the full article semantically.
sentences	Array of sentence objects, each with `text` and a 384-dim `embedding`. Omitted on cache hits.
sentiment_score	Signed float in `[−1.0, +1.0]`. Negative values indicate negative sentiment, positive values indicate positive sentiment. Magnitude reflects model confidence.
sentiment_label	One of `positive`, `neutral`, or `negative`. Derived from the RoBERTa classifier output label.

Process batch

POST /v1/process/batch

Submit multiple articles in a single request. Articles are processed in parallel using a thread pool — throughput scales with batch size, reaching 18.5 articles/sec at batch size 25. Cached articles return instantly and do not consume processing capacity.

REQUEST

{
            "articles": [
            {
            "article_id": "art_8f7h2k9s",
            "title": "Fed Holds Rates Steady",
            "content": "The Federal Reserve held interest
              rates..."
            },
            {
            "article_id": "art_3k9x2m7f",
            "title": "Powell Signals Patience on Cuts",
            "content": "Federal Reserve Chair Jerome Powell
              said..."
            }
            ]
            }
          

RESPONSE
{
            "success": true,
            "total": 2,
            "results": [
            { /* full ProcessArticleResponse for each article */ }
            ]
            }
          

Search entities

GET /v1/entities

Search and aggregate named entities across all Layer 1 processed articles. Returns entities ranked by mention count.

Parameter	Type	Description
query	string	Partial name match, case-insensitive. e.g. `powell` matches "Jerome Powell".
type	enum	Filter by entity type: `PERSON`, `ORG`, `GPE`, `LOC`, `EVENT`, `DATE`, `MONEY`.
min_mentions	integer	Only return entities seen at least N times. Default: `1`.
time_range	enum	Limit to articles within window: `1h`, `6h`, `24h`, `7d`, `30d`.
limit	integer	Max results. Default: `20`. Max: `100`.
offset	integer	Pagination offset. Default: `0`.

RESPONSE
{
            "entities": [
            {
            "name": "Federal Reserve",
            "type": "ORG",
            "mention_count": 89
            },
            {
            "name": "Jerome Powell",
            "type": "PERSON",
            "mention_count": 64
            }
            ],
            "total": 2,
            "query": "powell",
            "type_filter": null,
            "time_range": "24h"
            }
          

Entity timeline

GET /v1/entities/{entity_name}/timeline

Daily mention counts for a named entity over a specified window. Useful for detecting spikes and tracking narrative evolution.

Parameter	Type	Description
entity_namepath	string	Exact entity name, case-insensitive. e.g. `Federal Reserve`.
type	enum	Optionally scope to a specific entity type. Useful when the same name appears as multiple types.
time_range	enum	`1h`, `6h`, `24h`, `7d`, `30d`. Default: `7d`.

RESPONSE
{
            "entity": "Federal Reserve",
            "type_filter": null,
            "time_range": "7d",
            "total_mentions": 312,
            "timeline": [
            { "date": "2026-04-23", "mention_count": 34 },
            { "date": "2026-04-24", "mention_count": 41 },
            { "date": "2026-04-29", "mention_count": 89 }
            ]
            }
          

Entity sentiment

GET /v1/entities/{entity_name}/sentiment

Returns daily average sentiment for all articles mentioning a named entity over the specified window. Useful for tracking how coverage tone shifts around a person, organization, or location over time.

Parameter	Type	Description
entity_namepath	string	Exact entity name, case-insensitive.
type	enum	Optionally scope to a specific entity type.
time_range	enum	`1h`, `6h`, `24h`, `7d`, `30d`. Default: `7d`.

RESPONSE
{
            "entity": "Federal Reserve",
            "type_filter": null,
            "time_range": "7d",
            "avg_sentiment": -0.312,
            "total_articles": 89,
            "timeline": [
            {
            "date": "2026-04-23",
            "avg_sentiment": -0.201,
            "article_count": 12,
            "distribution": {
            "positive": 2,
            "neutral": 5,
            "negative": 5
            }
            }
            ]
            }
          

← Previous

Layer 0 — Token Intelligence

Layer 2 — Story Clustering