Layer 1 — Semantic Analysis
Layer 1 takes every article that passes the Layer 0 quality gate and extracts structured meaning: named entities, sentence-level embeddings, locations, and temporal markers. Results are cached — repeat calls return instantly.
What Layer 1 does
Layer 1 runs five operations on every article, in order:
- Sentence segmentation — splits article body into individual sentences for fine-grained analysis.
- Named entity recognition — extracts people, organizations, locations, events, dates, and monetary values using spaCy.
- Sentence embeddings — generates a 256-dimensional vector per sentence using
all-MiniLM-L6-v2, enabling sentence-level semantic search. - Article embedding — produces a single 256-dimensional article-level vector aggregated from sentence embeddings.
- Geographic and temporal tagging — extracts location mentions and date references for geo and timeline queries.
- Sentiment scoring — scores each article on a
−1.0to+1.0scale usingcardiffnlp/twitter-roberta-base-sentiment-latest, a RoBERTa model fine-tuned on 130M tweets. Returns a signed float and apositive/neutral/negativelabel.
Entity types
Layer 1 extracts seven entity types. Five are available as search filters; two (DATE, MONEY) are extracted and stored but not currently filterable via the entities endpoint.
| Type | Description | Example |
|---|---|---|
PERSON |
Named individuals | Jerome Powell, Andy Jassy |
ORG |
Companies, agencies, institutions | Federal Reserve, AWS, Tesla |
GPE |
Geopolitical entities — countries, cities, states | United States, Beijing, Texas |
LOC |
Non-political locations | Pacific Ocean, Strait of Hormuz |
EVENT |
Named events | G7 Summit, Super Bowl |
DATE |
Temporal references — extracted, not filterable | last Tuesday, Q3 2026 |
MONEY |
Monetary values — extracted, not filterable | $4.2 billion, €500M |
en_core_web_sm and affects a small fraction of extractions. Overall entity recall is
93.8% on benchmark test cases.
Performance
| Metric | Value |
|---|---|
| Single article p50 (short, ~150 chars) | 52ms |
| Single article p50 (medium, ~500 chars) | 94ms |
| Single article p50 (long, ~1200 chars) | 147ms |
| Batch throughput (batch size 25) | 18.5 articles/sec |
| Concurrent throughput (4 workers, warm) | 20.1 articles/sec |
| Entity extraction recall | 93.8% |
| Cache hit latency | <1ms |
article_id return cached results instantly with
processing_time_ms: 0.The sentences array is omitted from cached responses. This is
a known limitation — sentence data is stored in ChromaDB and will be returned from cache in a future update.
Endpoints
https://layer1.api.polariapi.com. For example:
POST https://layer1.api.polariapi.com/v1/process
Process article
Process a single article through the Layer 1 pipeline. Returns entities, sentence embeddings, locations,
and an article-level embedding. Results are cached by article_id.
| Field | Type | Description |
|---|---|---|
| article_idrequired | string | Unique identifier — used for caching and DB storage. Should match the article_id from
Layer 0. |
| title | string | Article headline — included in entity extraction. |
| contentrequired | string | Article body text. |
| url | string | Article URL — stored for reference. |
| published_date | ISO 8601 | Original publication datetime — used for timeline queries. |
| Field | Description |
|---|---|
| stats.sentence_count | Number of sentences extracted from the article. |
| stats.entity_count | Total named entities extracted across all types. |
| stats.embedding_dim | Embedding dimensionality — always 256 for Layer 1. |
| stats.processing_time_ms | Server-side processing time. Returns 0 on cache hit. |
| entities | Dict keyed by entity type. Each value is an array of unique entity strings found in the article. |
| locations | Deduplicated list of GPE and LOC entities — convenience field for geographic queries. |
| article_embedding | 256-dimensional float array representing the full article semantically. |
| sentences | Array of sentence objects, each with text and a 256-dim embedding. Omitted
on cache hits. |
| sentiment_score | Signed float in [−1.0, +1.0]. Negative values indicate negative sentiment, positive
values indicate positive sentiment. Magnitude reflects model confidence. |
| sentiment_label | One of positive, neutral, or negative. Derived from the RoBERTa
classifier output label. |
Process batch
Submit multiple articles in a single request. Articles are processed in parallel using a thread pool — throughput scales with batch size, reaching 18.5 articles/sec at batch size 25. Cached articles return instantly and do not consume processing capacity.
Search entities
Search and aggregate named entities across all Layer 1 processed articles. Returns entities ranked by mention count.
| Parameter | Type | Description |
|---|---|---|
| query | string | Partial name match, case-insensitive. e.g. powell matches "Jerome Powell". |
| type | enum | Filter by entity type: PERSON, ORG, GPE, LOC,
EVENT, DATE, MONEY.
|
| min_mentions | integer | Only return entities seen at least N times. Default: 1. |
| time_range | enum | Limit to articles within window: 1h, 6h, 24h, 7d,
30d.
|
| limit | integer | Max results. Default: 20. Max: 100. |
| offset | integer | Pagination offset. Default: 0. |
Entity timeline
Daily mention counts for a named entity over a specified window. Useful for detecting spikes and tracking narrative evolution.
| Parameter | Type | Description |
|---|---|---|
| entity_namepath | string | Exact entity name, case-insensitive. e.g. Federal Reserve. |
| type | enum | Optionally scope to a specific entity type. Useful when the same name appears as multiple types. |
| time_range | enum | 1h, 6h, 24h, 7d, 30d. Default:
7d.
|
Entity sentiment
Returns daily average sentiment for all articles mentioning a named entity over the specified window. Useful for tracking how coverage tone shifts around a person, organization, or location over time.
| Parameter | Type | Description |
|---|---|---|
| entity_namepath | string | Exact entity name, case-insensitive. |
| type | enum | Optionally scope to a specific entity type. |
| time_range | enum | 1h, 6h, 24h, 7d, 30d. Default:
7d.
|