Case Study

TOPZLE — The fastest visual Wikipedia list engine

TOPZLE turns Wikipedia tables/lists into clean, fast, searchable pages with auto-charts, timelines, choropleths, and ranked lists — without writing custom chart logic per page. The hard part isn’t rendering one page — it’s keeping thousands of pages fresh, deterministic, and stable at scale.

Astro + SSR Worker Auto-charts Parsing & inference SEO scale Deterministic builds
TOPZLE logo

Lists → Collages → Charts (automatically)

Wikipedia tables are messy: nested headers, mixed units, year ranges, currencies, and inconsistent naming. TOPZLE normalizes the data, infers chart intent, and renders the best visualization per page — fast, stable, and SEO-ready.

1) Problem statement

Most “list sites” fail in one of two ways: they either hardcode pages manually, or their ingestion becomes inconsistent, expensive, and impossible to reason about over time. TOPZLE’s goal is a repeatable publishing system: scrape/normalize structured data and render it using reusable chart primitives — not one-off chart code per page.

2) Data sources & ingestion

TOPZLE is primarily built on Wikipedia list & table data (structured HTML tables and list pages). The platform’s ingestion layer extracts: titles, slugs, table headers, rows, thumbnail/collage images (when present), and link metadata — then converts it into a stable internal model for rendering.

A. Source signals kept for trust

  • Provenance: preserve where a value came from and when it was last refreshed.
  • Stable identity: stable IDs per entity/page/row so re-fetching doesn’t drift.
  • Schema evolution: allow new metrics/columns without breaking older pages.

The product assumption is simple: “tables are a dataset, not a screenshot”. Once you treat Wikipedia tables as datasets, correctness, determinism, and normalization become the core engineering work.

3) How it works: end-to-end SSR (Request → Response)

TOPZLE ships static assets plus an SSR Worker that renders either the homepage/search grid or a dynamic wiki page. SSR gives fast first paint + SEO-grade HTML, while still allowing charts to progressively enhance in the browser.

Request → SSR Worker → Page render → HTML
SSR
User
Visits / or /{lang}/{slug}
SSR Worker
Astro SSR runtime
Page renderer
index / dynamic slug
API fetch
search / pinned / page data
HTML response
SEO + fast paint

A. Homepage vs Search vs Page

  • Home mode: shows pinned collages + trending buckets (deduped).
  • Search mode: renders grid results for ?q=....
  • Dynamic page: renders the selected Wikipedia-derived dataset and chart shell.

4) How it works: Auto-Charts (client pipeline)

After SSR loads the page, TOPZLE conditionally enhances the content: it embeds a compact payload of headers/rows, then the client boot code infers columns and selects the right chart renderer. D3 loads only when needed.

AutoCharts → Lazy D3 → Infer → Render
Client
AutoCharts shell
Embeds compact dataset
Boot script
Loads only on pages with tables
Lazy D3
Download only when chart needed
Inference
name/year/value + formats
Chart render
bars / multiKey / timeline / map

A. Inference rules (why this works at scale)

  • Detects nameCol (most text), yearCol (year/range/date-like), and strong numeric columns.
  • Ignores non-metrics like “Rank/Peak” and other noise columns.
  • Handles nested headers via breadcrumb labels (“Parent — Child”).
  • Parses currencies, percents, and suffixes (K/M/B) conservatively.

5) Chart modules (file-by-file)

Common modules: src/charts/bars.ts, src/charts/multiKeyBars.ts, src/charts/timeline.ts, src/charts/choropleth.ts, src/charts/rankList.ts

A. bars.ts

A deterministic single-metric ranked bar chart. Works when the dataset has a clear “name + value” shape. Designed to remain stable even when labels are long or values include formatting noise.

B. multiKeyBars.ts

Multi-metric comparison renderer (one entity, many numeric columns). Detects currency/percent per column and tolerates messy cells (including ranges) by taking the first numeric token consistently.

C. timeline.ts

Timeline renderer for year-based datasets, including year ranges (e.g., “2018–19”). Groups cards by year and keeps ordering strict — avoids “pretty but wrong” interpolation.

D. choropleth.ts

Choropleth renderer that only activates when region matching is strong enough (>1 region matched). Normalizes region keys and uses aliasing to handle naming mismatches between datasets and geojson keys.

E. rankList.ts

A clean ranked list module (Gold/Silver/Bronze) optimized for scannability. Deterministic ordering with stable tie rules reduces churn across builds.

6) Parsing & normalization stack

The real work is converting “Wikipedia table HTML” into clean rows with predictable types. TOPZLE’s parsing layer is designed to survive: weird encodings (mojibake), nested headers, mixed units, and partial data.

Normalization pipeline
Data
1) Decode + clean text
UTF-8 + entity cleanup
  • Fixes common mojibake + HTML entity artifacts.
  • Normalizes whitespace + punctuation noise.
2) Flatten headers
spans → breadcrumbs
  • Handles row/col spans and nested header grids.
  • Produces labels like “Parent — Child” for stability.
3) Infer columns
name / year / numeric
  • Finds year/range/date-like columns.
  • Finds strong numeric metrics; ignores “Rank/Peak”.
4) Parse values
currency / % / KMB
  • Currency & percent hints per column.
  • Ranges become “first numeric token” for determinism.
5) Render decision
best-fit chart
  • Timeline if years exist.
  • MultiKey bars if multiple metrics exist.
  • Choropleth only if region match is strong.

7) Performance & SEO system

A. Performance

  • Minimal JS on list pages; charts load only when needed.
  • Lazy images + CLS-safe thumbnails with explicit dimensions.
  • Bounded work: inference and rendering is constrained per page to avoid UI stalls.

B. SEO

  • Semantic HTML + accessible landmarks.
  • OpenGraph/Twitter tags per page.
  • Canonical URLs, strong title/description rules.
  • JSON-LD surfaces: WebSite + ItemList (featured/pinned).

8) Engineering constraints solved

A. Determinism across builds

TOPZLE enforces deterministic transforms and stable serialization so rebuilds don’t reshuffle ranks or reorder rows due to minor parsing noise. This reduces SEO churn and makes debugging tractable.

B. Incremental updates without drift

Scraped sources evolve. The ingestion and parsing boundaries are designed for schema evolution and safer fallbacks so source changes degrade gracefully instead of breaking pages.

C. Expressive charts without per-page custom code

The module system makes it cheap to add new chart primitives and apply them broadly. Pages are “dataset + renderer decision”, not hand-built visualizations.

9) Status & roadmap

TOPZLE is designed as a long-term platform. The next stages are focused on expanding chart primitives, improving provenance display (refresh time and source signals in-page), and strengthening “related lists” discovery without sacrificing performance.

For technical or partnership discussions related to TOPZLE, reach out via the Azonova contact form.