Back to work
Client data platformSearch infrastructure and data pipeline engineering

2M+ record search workflow

Large-Scale Search Platform

Search infrastructure that made 2M+ records usable through a WordPress front end.

Project summary

Built and maintained the backend pipeline that let a large client dataset stay searchable through WordPress without forcing the CMS to store or query everything directly.

A WordPress-facing search experience backed by Python, PostgreSQL, Elasticsearch, and queues.

Abstract data grid illustration for the large-scale search platform

Project summary

Dataset size

2M+ records

Frontend shell

WordPress

Built and maintained the backend pipeline that let a large client dataset stay searchable through WordPress without forcing the CMS to store or query everything directly.

Buyer-facing summary

Client problem

The product needed fast, reliable search and filtering across a large dataset.

What I delivered

I worked on backend search logic, API-driven data access, and production-ready implementation patterns for large records and user-facing queries.

Business result

The platform could support high-volume search workflows with a cleaner experience for users.

Problem

The client needed a very large dataset, well beyond two million records, to be searchable by end users through a WordPress front end.

Keeping that data inside WordPress would have made the CMS do the wrong job. The real problem was building a proper data pipeline and search stack while preserving WordPress as the public interface.

The client needed WordPress to remain the public-facing layer while the real data volume outgrew typical CMS patterns.
The dataset updated often enough that import, indexing, and export workflows all needed to recover cleanly from partial failure.

What I built

Separated data and presentation layers

Kept the dataset in PostgreSQL, indexed search in Elasticsearch, and let WordPress focus on presenting search results rather than storing or querying millions of records directly.

Python ingestion pipeline

Built batching, normalization, resume-after-failure behavior, and indexing flows that could handle ongoing imports without collapsing under volume.

RabbitMQ-based job orchestration

Moved large processing steps into queued workflows so re-indexing and heavy operations did not block the rest of the platform.

AWS-backed export handling

Implemented export flows that wrote large outputs to AWS rather than trying to generate everything synchronously inside a user request.

Python ingestion and normalization pipelinePostgreSQL source-of-truth storageElasticsearch indexing and query designRabbitMQ-backed asynchronous job workflowsAWS-based export handling

Technical decisions

PostgreSQL and Elasticsearch were separated intentionally so WordPress could stay focused on presentation instead of becoming the bottleneck.
Index mapping and query design had to be iterated against real search behavior, not just default Elasticsearch settings.

Elasticsearch defaults are not enough once query patterns and data volume become real. Index mapping and query structure needed deliberate tuning around how people actually searched the data.

A meaningful share of the work was about reliability: batching, recovery after partial failure, and keeping the public search experience insulated from backend processing jobs.

Outcome

Users get fast search over a dataset that is much too large for a normal WordPress implementation.
The pipeline keeps absorbing updates without turning re-indexing into downtime.
WordPress remains useful as the public shell because the heavy lifting is handled elsewhere in the stack.

What I would improve

I would front-load more of the index-mapping and search-pattern analysis before the first rounds of reactive optimization.

The system worked out well, but some re-indexing work could have been avoided with earlier analysis of real usage patterns.

Tech stack

PythonPostgreSQLElasticsearchRabbitMQAWSWordPress

Next step

If you need similar work, let’s talk through the constraints first.

The useful part of a project like this usually starts before code: understanding what the CMS should own, what should live in a backend service, and where integrations or automation can stay maintainable.

Start a conversation