Skip to main content
Eight stages · Every public listing · Always fresh

How every car ends up here.

The internet says one thing, the market says another. Public car listings sit scattered across a dozen places, in a dozen formats, with the same physical car often appearing in three of them at once. The AllCars indexer takes that sprawl and turns it into one clean, deduplicated, enriched, continously-refreshed feed. So you can search the whole market the way youd search one website.

Pipeline stages
8
discover ? index
Active listings
11k+
unified · deduplicated
Refresh cycle
many·daily
incremental · continuous
Price observations
700k+
append-only history

One feed. Always fresh. Nothing lost.

We index public listings the same way a search engine indexes the public web. Every listing, deduplicated against the rest, enriched with the specs and tax data the listing itself doesnt carry, and tracked over time so the price story is preserved even when an ad disappears off the face of the internet.

Search the whole market

Stop hopping between tabs. Every public Cyprus car listing in one search box, with the same filters, the same scoring, the same fair-price band. Doesnt matter where the ad originally lives.

Always fresh

The market moves. The indexer runs continuously, so new listings, price drops and removals show up within hours, not weeks. What you see is the market right now, not a Tuesday snapshot from three weeks ago.

Nothing lost

Every observation gets logged. When a listing vanishes, the history doesnt vanish with it. First asking price, every drop, time to disappear, all kept. The market has a memory now.

From raw listing to search result, in eight stages.

Every listing flows through the same eight stations. Each one is a small, sharp idea, and each one earned its place by catching a specific class of failure the indexer used to ship. Painfully.

01 Discover find listings 02 Validate reject garbage 03 Normalise canonical form 04 Parse extract tags 05 Enrich add specs 06 Dedupe merge twins 07 Track price history 08 Index search-ready
STEP 01

Discover

Find new public listings as they appear. Only stuff thats already publicly visible to anyone with a browser. No walls, no private feeds, no logins.

STEP 02

Validate

Reject impossible data at the door. Future-dated years, half-a-million-kilometre mileage, location strings dressed up as descriptions, the lot. Caught before they touch the index.

STEP 03

Normalise

One canonical form per make, model, body type, and fuel, across spelling variants, chassis codes and language quirks. "W211" and "E-Class" finally agree theyre the same thing.

STEP 04

Parse

Pull structured tags out of free-form descriptions. Mileage hidden in a sentence, fuel type buried in a paragraph, service-history mentions, accident keywords. English, Greek, Greeklish, Russian and Russlish all handled.

STEP 05

Enrich

Bolt on the specs the listing doesnt carry. Horsepower, torque, fuel consumption, kerb weight, Cyprus road-tax band, all pulled from public vehicle databases. The listing becomes a vehicle.

STEP 06

Deduplicate

Three listings of the same physical car? One vehicle record, one timeline. Image fingerprints and spec match find the twins. Hard veto rules stop false merges from happening.

STEP 07

Track

Append-only observation log. Every price seen, every change, every disappearance, all preserved. The feed shows you today. The history shows you the whole story.

STEP 08

Index

Pre-compute everything search needs (enrichment, scoring, tags, road tax) into a hot index. Warm queries return in under half a second. The pricing engine plugs in here too.

Discovery, public listings only.

The indexer behaves like a polite reader. It looks at the same public pages a buyer would, in moderate cadences, with standard request hygiene. No private accounts, no walled areas, no personal-data harvesting.

What flows in is exactly whats already on the open web: the public listing. Make, model, year, mileage, price, public photos, the sellers own free-form description. Thats the entire input.

Anything inside a login wall, anything marked private, and anything that looks like personal contact data is left alone. The index reflects the public market. Full stop.

PUBLIC LISTINGS make · model · year · mileage · price make · model · year · mileage · price make · model · year · mileage · price SAFE-BY-DESIGN ? public-only ? no logins ? no PII collected ? polite cadence listings stream in ? one canonical record each
VALIDATION GATE year = now+1 year = 1900 mileage < 1,000,000 price = €500 desc has digits non-car keyword block ? admit year = 2031 575M km "Limassol" raw ~1.5% of incoming records bounce here · catches 100% of known-impossible cases

Reject impossible data at the door.

The cheapest bug to fix is the one that never enters the system. Validation runs before anything else: future-dated registrations, integer-overflow mileage, prices below a spare-parts floor, descriptions made of nothing but a city name, parts-and-accessories listings dressed up as cars.

Everything that survives this gate is at least plausibly a real car, which means downstream stages dont need defensive logic for the obviously broken cases.

One canonical name per car.

The same car shows up under five different names. Chassis codes in parentheses, brand variants spelled four ways, dealer-specific shorthand, language mixes. Without normalisation, search for one model and you miss half the market.

A rules engine collapses every variant down to one canonical identity per make, model, body and fuel type. So one search for "E-Class" finds them all, and the pricing engine peers them against each other properly.

Mercedes-Benz W211 MERCEDES E CLASS Mercedes Benz E-Class Mercedes E220 / W211 µe?se?te? E class CANONICAL MERCEDES · E-CLASS 100+ rules · chassis codes · spelling variants · language mixes · dash/comma handling
RAW LISTING make Audi model A4 year 2018 mileage 82,000 km price €18,500 fuel diesel engine 2.0 enrich ENRICHED RECORD + horsepower 190 HP + torque 400 Nm + 0–100 7.6 s + fuel cons. 5.1 l/100 + kerb weight 1,560 kg + road tax band CO2 era + road tax €225 / yr + description tags service · history + photo quality 12 photos · HD + deal score 82 / 100

Add the data the listing forgot.

A listing tells you what the seller chose to type. It rarely tells you horsepower, torque, fuel consumption, kerb weight, or what your road tax bill is going to look like in January. The indexer bolts these on from public vehicle databases the moment a model is recognised.

The Cyprus road-tax calculator is built in: three registration eras, CO2 bands, Euro surcharges. UK imports get the dual-rate display so theres no nasty surprise three weeks after you collect the car.

By the time a listing leaves enrichment, its no longer a listing. Its a fully described vehicle.

Same car, one timeline.

The same physical car often appears in three listings at once. Different prices, different photos, different descriptions. Without dedup, every search returns the same car five times over and the price chart looks like noise.

The deduper merges twins using two signal families: image fingerprints (perceptual hashes computed over public photos so the original images arent retained as personal data) and spec match (year + make + model + mileage + engine close enough to be the same car).

Hard veto rules stop false merges. Mismatch in body type, fuel, year, or colour and the merge is rejected, even if every other signal is screaming "match". Different cars stay different cars. Always.

listing #A 2018 A4 · 82,000 km €18,500 12 photos listing #B 2018 A4 · 82,000 km €18,200 10 photos · same listing #C 2018 A4 · 82,500 km €17,900 8 photos · same image hash ? spec match ? VEHICLE #v412 2018 Audi A4 3 listings · 1 timeline €18,500 ? €17,900 HARD VETOES body · fuel · year · colour mismatch ? no merge
days live ? price ? appeared day 0 drop 1 d 14 drop 2 d 32 vanished d 47 €18,500 · €17,900 · €17,200 · removed · preserved as one timeline

Every observation, every change, kept.

The lifecycle layer is an append-only log of every observation. First asking price, every subsequent drop, the days the ad was live, the moment it disappeared. Nothing is overwritten. Nothing is forgotten.

When a listing vanishes, thats a signal too. Cars that disappear within 48 hours of a price drop probably sold. Cars that linger for 90 days probably didnt. The pricing engine reads this stream to correct for survivorship bias.

And the buyer gets the full price story instead of a snapshot.

Pre-computed. Hot. Sub-second.

When you tap "search", you dont want the engine to start thinking. You want the answer. The final stage materialises everything (enrichment, scoring, tags, road tax, deal band) into a hot index that warm queries hit in under half a second for typical filter combinations.

The pricing engine plugs in here. The browse experience plugs in here. Saved-search alerts plug in here. One source of truth, many surfaces.

When the upstream data changes, only the affected slice re-materialises. Incremental, not nuclear.

QUERY make = Audi 2016–2020 < 0.5 s VEHICLES ENRICHMENT SCORES + TAGS HOT INDEX ? 342 results · sorted by deal score · paged

What's running, right now.

The pipeline is not a side project. It runs continuously, gets instrumented like infrastructure, and earns every line through an incident. Here's the order-of-magnitude shape of it.

many·daily
incremental refreshes per day
11k+
active deduplicated listings
700k+
price observations logged
100+
make / model normalisation rules
1.3k+
duplicate listings merged into one car
< 0.5 s
warm query response, indexed
5
languages parsed (en · gr · greeklish · ru · russlish)
0
personal data collected from sellers

Public market, public data, light footprint.

AllCars indexes a public market. We treat the underlying sources the way any considered reader would, and we draw a hard line around personal data. These rules arent aspirational. Theyre baked into the indexer itself.

Public listings only

Anything publicly visible to a logged-out browser is fair to read. Anything behind a wall is not. The indexer never attempts to access private feeds, paywalled content, or areas that require authentication.

No personal data

The index stores facts about cars, not facts about people. No personal contact details, no buyer or seller profiling, no tracking individuals across the network. The dataset describes vehicles in a public market.

Polite cadence

Refresh runs are paced to be a small fraction of normal public traffic and respect platform-level signals. The indexer is a quiet reader, not a load test.

Right to be removed

A removal request, a hide-from-index request, or a source-side opt-out is honoured promptly across the index, history, and search results. It's just me running this, message me on Telegram and the record is gone the same day.

Now go search the whole market.

One feed. Always fresh. Deduplicated, enriched, scored. Spend your attention on cars, not tabs.