Company Z
Introduction
CodeLab had the tremendous opportunity to work with a Fortune 15 firm that is one of the world’s premier leaders in Renewable Energy and Natural Gas during the 2025 Winter Cohort. To respect the highly competitive landscape in which the firm operates and our contractual obligations, we are unable to name the company directly. However, we are excited to detail the distributed geospatial data scraper that we developed, which is currently powering the firm’s renewable energy and natural gas (RNG) team’s market research.
Our solution provided a scalable, distributed, and cost-effective platform for data analysts and market researchers to identify potential locations for expanding their RNG station portfolio. This project was one of two projects done for the firm, the second you can read here.
Timeframe
January — May 2025 | 8 weeks
Tools
Design — Figma, Adobe After Effects
Development — Next.js, Tailwind CSS, RabbitMQ, JavaScript, FastAPI, PostgreSQL, Docker, Azure Kubernetes Service
Maintenance and Management — Azure, GitHub, Notion, Jira
Approaching the Tech Stack
User Experience and Front End: Developing for the UX
We knew that we had to have a highly adaptable, modern, and maintainable ecosystem for our application — something that balanced compatibility with efficiency.
Tanstack Table was a natural fit for displaying job data in our frontend. Our data, stored in PostgreSQL and exported in Excel-optimized formats, was already organized in a column-based structure. This made it easy to map directly into TanStack’s table components, reducing overhead and accelerating our frontend development while keeping the frontend sleek and easily readable.
The Engine Driving the Project
We selected Puppeteer as our primary scraping tool for its deep control over browser behavior, flexibility in navigating complex DOM structures, and ability to simulate real user interactions. Its headless browser automation provided the precision and reliability needed to extract structured data from our geospatial data source. Beyond just scraping, Puppeteer allowed us to implement sophisticated anti-bot evasion techniques, such as randomized input timing and user-agent spoofing, which significantly increased our scrape success rates across protected platforms.
Distributivity and RabbitMQ
Given the national scope of our project, which allows for jobs to cover over 41,000 ZIP codes across the continental U.S., scalability and job throughput were top priorities. RabbitMQ enabled us to decouple task generation from task execution, acting as a fault-tolerant message broker between the scheduling system and the network of scraper nodes. This architecture allowed us to dynamically queue, balance, and distribute scraping jobs across multiple regions and worker instances.
Distributivity was critical to keeping job completion times low and system performance consistent, even under peak loads. Additionally, the message-based approach enabled fine-grained control over retries, monitoring, and resource usage.
Frictionless Integration
In our talks with application architects at the firm, they expressed their unfamiliarity with many new frameworks and tools. To ensure a seamless integration and minimize ramp-up time, we intentionally aligned our solution with the firm’s current tech stack — choosing PostgreSQL for our database solution and developing with a plan to deploy within their existing Azure Cloud environment. This approach not only respected their infrastructure team’s technical comfort zone but also reduced overhead, increased maintainability, and allowed for quicker deployment within established infrastructure and workflows.
Meet the Team!
Our Problem
Our partner firm aims to expand its clientele that manage large fleets of RNG-powered vehicles. As such, they want to discover strategic points where there are clusters of logistical centers where it would be ideal to construct an RNG station. In order to do this, the RNG team is tasked with finding companies that maintain sufficiently large fleets in high-density logistic areas.
The traditional method of market research is extremely time-consuming, as market analysts have to manually collect the addresses of businesses to analyze. This led to extremely long turnaround times coupled with missing or inaccurate data, as the firm was employing a hybrid SaaS and manual solution that was both expensive and inefficient.
Development Process
This application’s development process has challenged yet also educated us vastly on distributed systems, webscraping, and cloud infrastructures. The diagram below illustrates our end-to-end architecture.
We’ll be breaking each section down to show how we were able to leverage this distributed strategy to deliver on the firm’s business needs.
Client Tier
Our frontend architecture is centered around a Next.js web application, chosen for its performance, flexibility, and seamless developer experience. We integrated TanStack Table for advanced data presentation, particularly table rendering, to support our structured and data-heavy workflows.
Tailwind CSS powers our styling layer, enabling fast, consistent, and responsive UI development. The frontend communicates with our FastAPI backend through RESTful HTTP/HTTPS requests, keeping the interface dynamic and data-driven.
Distributed Scraping Infrastructure
The distributed scraping infrastructure was designed to be efficient, scalable, and intelligent, capable of handling thousands of ZIP code-specific scraping tasks with speed and reliability.
At the core of the system is a scheduling service that generates job tasks and passes them into a RabbitMQ message broker, which acts as the central queue — decoupling task distribution from execution to maximize throughput and fault tolerance. Deployed across multiple nodes, Puppeteer-based web scrapers consume these tasks in parallel, dramatically reducing scraping time compared to traditional sequential methods.
Each scraper is equipped with anti-bot measures such as headless detection evasion and randomized behavior, improving success rates across a wide range of target websites.
A key feature of the system is its ability to optimize job distribution algorithmically: it monitors load imbalances and identifies regions with heavy task volumes, then redistributes jobs across the scraper pool in real time to maintain efficiency and prevent slowdowns. In the event of a failure, reports are automatically pushed back to RabbitMQ, enabling retry logic or logging without manual intervention.
Once data is successfully collected, it’s passed downstream to the ZIP Data Management Service as structured, validated output — ensuring consistency, accuracy, and readiness for analysis. This modular, event-driven design provides a robust foundation for high-volume scraping with built-in resiliency, maintainability, and intelligent workload management.
API and Application Tier
The API and Application Tier serve as the central hub for facilitating user interactions, job scheduling, and job data management.
At the center of this tier is the FastAPI backend service, which handles all HTTP/HTTPS requests from the Next.js frontend. Acting as the primary API layer, it processes user inputs, queries the relational database via ORM, manages ZIP-level data for region targeting, and triggers new scraping jobs as needed.
The relational database supports this flow by storing user information, job metadata, and system state using efficient, structured SQL queries. When new scrape jobs are initiated — either through user actions or system triggers — the FastAPI service communicates with the job scheduler, which creates the tasks and publishes them to the RabbitMQ message broker powering the distributed scraping system. Scraped data is ingested into the ZIP Data Management Service, which receives input from both the backend and directly from Puppeteer scrapers, ensuring that the data remains current and accurate.
Within the ZIP Data Management Service, a pandas-based program processes incoming job data to ensure it’s clean, accurate, and ready for export. As data is ingested, it’s filtered in a DataFrame using business tags to remove invalid entries, resolve duplicates, and validate key fields like ZIP codes and addresses. Once cleaned, the data is structured into a tabular format optimized for Excel export, ensuring users receive accurate, analysis-ready files at the time of download with no additional cleanup or object storage required.
Cloud Planning and Preparation
By preparing our scraping infrastructure for a production-ready cloud environment, our architectural decisions allow for a practically out-of-the-box deployment within Azure Kubernetes Service (AKS) with minimal changes by the firm’s infrastructure teams. By designing a modular system — centered around containerized Puppeteer scrapers, a decoupled message queue (RabbitMQ), and independently scalable services — we’ve laid the groundwork for a highly flexible and resilient scraping pipeline.
In the envisioned AKS deployment, incoming tasks are routed through a Kubernetes service and processed by RabbitMQ within the cluster. These tasks are then distributed to scraper pods, allowing us to take full advantage of AKS’s autoscaling and workload orchestration capabilities. Successfully scraped data is pushed through AKS Ingress to downstream services for aggregation and storage.
This cloud-native design not only supports high throughput and fault tolerance but also prepares us to handle larger data volumes, dynamic job allocation, and future expansions with minimal friction. As usage and job scope grow, we’re confident in our application’s ability to scale efficiently, monitor performance, and deploy updates seamlessly across the pipeline.
Design Process
Our designs prioritized easy and quick access to jobs and their associated data. When users enter our application, they have access to their recent activity, favorite jobs, as well as the status of currently running instances. Our user base is both more technical data analysts as well as the firm’s sales division, so we balanced the access to completed data as well as job creation.
Our scraper configuration page allows for complete control of the scraper’s scope and collection. Users can choose to either target a specific company by name, or run a query on all business types in the search region. This allows for not just the RNG and sales teams to utilize it, as it’s able to conform to anyone’s needs.
We give the user complete, fine-grained control of search scope as well. Our client operates their RNG services in the continental Unites States, so we offer them the ability to do a comprehensive search across the country in one simple setting. If they are targeting a specific region, they can get as granular as a city or metropolitan area in order to have as relevant data as possible.
We allow for the customization of the exported file name and email notification as well, which allows for ownership of each file.
Our main focus on the history and favorites pages is allowing for users to easily identify, distinguish, and access jobs and their associated data. We chose a table format due to the large expected volume of jobs that departments could run, letting the user parse through high amounts of jobs cleanly and efficiently. Quick access preview, favorite, and download buttons reduce search time as users are able to extract or view data without clicking into each input.
We have an animated loading bar for when the user chooses to download the jobs. This keeps the user informed as the backend generates the Excel file, as our data is converted from PostgreSQL at time of download.
When clicking into a data point, users can easily preview and verify the data collected from each job. We chose to mirror an Excel view that our userbase will be accustomed to, with greater spacing and colored tags for readability.
A Blazingly Fast Impact
To meet the firm’s needs for speed, accuracy, and operational efficiency, our system was designed for blazingly fast impact. Rising SaaS costs were addressed through our optimized cloud architecture, reducing overhead without compromising performance. The firm’s critical need for accurate geospatial data is met with our custom validation logic, eliminating errors at the source and reducing manual cleanup.
Research that previously took hours — often spread across fragmented sources — is now completed in seconds using our distributed scraping and intelligent job scheduling system. Finally, data is delivered in a clean, Excel-ready format tailored to the firm’s workflow, ensuring teams can act on insights immediately. Each layer of our stack directly supports the firm’s goals: lowering cost, improving data confidence, and dramatically accelerating turnaround time.
Closing and Thanks!
The team would once again like to extend a heartfelt thanks to everyone from the firm who supported us during the development process! We learned the ins and outs of the RNG and energy market, which few of us had experience or knowledge in previously. Everyone we interacted with was wholly supportive, and it made the product development an enriching and great environment.