What Sets the Kogan.com Engineering Culture Apart

At Kogan.com, engineering is about building software that is useful and seeing them used at scale. We work in a fast moving e-commerce environment, so the problems are real and often complex. Performance, reliability, scale, legacy constraints, new features, tight feedback loops. We ship frequently, deploy daily, and continuously improve what is already live. You can see the impact of your work quickly, and so can our customers.

Yes, there are hackathons, plenty of snacks, meetup pizzas and team events. But what defines the culture is ownership. Engineers are trusted to make decisions, go deep into systems, challenge assumptions, and drive outcomes. That might mean building a new platform, untangling and modernising legacy code, improving observability, or removing bottlenecks that affect millions of users.

Teams are pragmatic and hands on. We care about clean code and good architecture, but we also care about delivering value. There is a strong bias toward action and continuous improvement over perfection. Collaboration is key. Engineers work closely with product, design, data, and commercial teams. Context is shared openly, trade offs are discussed honestly, and ideas are judged on merit. To bring our culture to life, we spoke to three of our engineers about the engineering culture here:

Shams Saaticho

How do ideas go from concept to production here?

Ideas come from stakeholders, marketing, product, UX or engineering. The first step is clarity. What problem are we solving? What measurable outcomes define success? What constraints or trade offs exist?

Engineers and stakeholders align on requirements and break larger initiatives into small, testable increments with clear acceptance criteria. Once scoped, work is prioritised in the backlog.

During development, changes go through peer review, automated testing and CI checks, plus user acceptance testing where needed. Deployment happens through CI/CD.

Shipping is not the finish line. We monitor production metrics and behaviour to confirm impact and guide iteration. Low risk improvements can move from idea to production quickly, often within a day

What does ownership mean in practice for engineers?

Ownership is responsibility for a domain built through context and accountability. It means understanding how a system works, why it was designed that way and what trade offs shaped it.

Engineers are expected to recommend approaches within their domain, propose refactors, and flag decisions that introduce long term cost. Ownership is proactive, not reactive.

It does not mean working in isolation. In a large system, changes often cross services, so knowledge is shared and decisions are communicated clearly.

Ownership ultimately means end to end accountability. What is delivered must be reliable, secure, maintainable and scalable, with consideration for how future engineers will extend it.

Ivan Van

How do teams collaborate across Product, Design, Data, and the wider business?

At the heart of it all, we want to build the right product. That means involving the right people in the conversations that shape it. We start by clearly understanding the problem we’re trying to solve. From there, we work closely with tech leadership, product owners, and SMEs, keeping them informed throughout delivery. Once the work is done, we share the outcome with the relevant stakeholders and iterate as we learn how it performs in the real world.

What’s a recent example of the team improving a process or system?

There’s always an idea floating around somewhere here at Kogan. Whether it comes from a lunch conversation, someone sharing a cool article in our eng-tech channel, or a quick mention by the water cooler, ideas for improving things are always in the air. Right now, we’re evaluating git worktrees so developers can have multiple agents working concurrently, making better use of time for people who use agents in their daily development workflow. We also have a hack day coming up, where everyone has the opportunity to deliver quality-of-life and product improvements.

Michael Lisitsa

What does “good engineering” look like at Kogan.com?

We ship work that creates measurable impact for customers and internal stakeholders, not just completed tickets. Decisions are grounded in data and validated in production to ensure solutions perform as expected under real conditions and scale.

We treat code as a long term asset. That means actively reviewing legacy systems, reducing technical debt and modernising where it makes sense. Quality, maintainability and performance are considered upfront, not retrofitted later.

Engineers are expected to contribute across their team’s board and collaborate across teams when required. They are empowered to own the full life cycle of their features, growing their capabilities in front end, back end and infrastructure.

How do teams balance speed with building things sustainably?

Work is broken into small tickets so value reaches customers quickly and changes remain easy to review, test and roll back if needed.

We leverage AI to improve speed and lift code quality, but engineers remain accountable for every decision. It is a productivity tool, not a replacement for judgment.

We apply the Boy Scout Rule consistently. Code is left better than it was found, with small improvements made along the way to patterns, performance and observability instead of deferring cleanup. Several times a year, Ship Frenzy creates space to clear smaller tasks that may not surface through standard prioritisation.

Planning is done just in time. We avoid deep discovery work too early, reducing wasted effort and keeping teams focused on what is ready to be built.

Improving Frontend Regression Testing with Chromatic

After recently migrating our frontend to Remix, we took the opportunity to reassess how we approach frontend testing, particularly regression testing. While we already had unit test coverage, we identified a gap when it came to validating UI changes. This is where Chromatic became a part of our frontend testing strategy. This post outlines why we introduced Chromatic and how it fits into a Remix-based workflow.

Even when application functionality remains unchanged, subtle visual regressions can still be introduced. Changes to spacing, typography, layout, or component states can easily slip through without being caught by traditional tests.

What we needed was a way to automatically detect meaningful UI changes while still fitting into our existing development workflow. At the same time, it was important to avoid introducing a fragile or high-maintenance testing setup, one that adds overhead without delivering proportional benefit. Our implementation with Chromatic attempts to balance automation, reliability, and developer experience as a practical addition rather than an extra burden.

Why Chromatic?

Chromatic provides visual regression testing on top of Storybook. Instead of testing components purely through assertions, Chromatic renders components in a real browser environment and captures screenshots. These are then compared against a known baseline to highlight visual changes.

The key reasons we chose Chromatic were:

  • Automated visual diffs that are easy to review
  • Integration with Storybook, which we already use for component development
  • CI-friendly workflow that fits well into pull requests

Chromatic offers two closely related features for reviewing UI changes: UI Review and Visual Tests. While they have some overlap of functionality, they serve different purposes and are designed for different levels of enforcement. UI Review is enabled by default, whereas Visual Tests are an optional feature.

UI Review generates snapshots that highlight differences against a baseline, with a focus on collaboration and feedback. A key characteristic of UI Review is its flexibility: anyone including the author can approve the changes. The approval is also persistent, once a UI Review is approved, further commits to the same branch do not require re-approval.

Visual Tests also generate snapshots and compare them against an approved baseline, but they treat those comparisons as authoritative. Visual Tests can be configured to run across multiple browsers or viewports. Unlike UI Review, any subsequent changes committed to a branch will require re-approval. Approval permissions can also be restricted to specific roles. Visual Tests incur additional cost, and testing across multiple browsers or viewports increase the number of snapshots generated.

Storybook as the Foundation

Chromatic works best when components are well-represented in Storybook. As part of our migration to Remix, we invested time in creating stories for our components.

  • Defining common UI states
  • Mocking Remix loaders and actions where needed

Chromatic simply builds on top of this foundation by continuously validating those stories.

Automatic API Mocking with OpenAPI

To keep our Storybook components realistic without manual effort, we built an automated pipeline that generates stories from our OpenAPI schema.

We use the swagger auto schema to document what each api endpoint returns, including example responses. A script then parses this backend schema alongside the remix route definitions to generate complete stories, including MSW handlers with example responses that exactly match real API contracts. The handlers are created for the full route hierarchy needed for a component to be rendered.

This means Storybook stories always stay in sync with the backend, and there is only a single source of truth to maintain, which is the OpenApi schema.

We use the below to parse the source code for a component to find imports from the API client.

const inspectModule = (filename: string) => {
  const code = readFileSync(`./app/$`, 'utf-8')
  const ast = parse(code, { jsx: true })
  const requiredMocks = ast.body
    .filter(
      (node) =>
        node.type === 'ImportDeclaration' &&
        [ '~/api/Api'].includes(node.source.value)
    )
    .map((node) =>
      (node as unknown as { specifiers: [{ imported: { name: string } }] }).specifiers.map(
        (specifier) => operations[specifier.imported.name]
      )
    )
    .reduce((acc, val) => [...acc, ...val], [])
    .filter(Boolean)

And then we use the below to build the response for the mocks

const buildExample = (schema: OpenAPIV3.SchemaObject): Example | undefined => {
  // recursively build an example from the examples given in the schema.
  if (typeof schema.example !== 'undefined') return schema.example as Example
  if (schema.type === 'array') {
    const example = buildExample(schema.items as OpenAPIV3.SchemaObject)
    if (typeof example === 'object' && example !== null && !Object.keys(example).length) return []
    return example ? [example] : []
  }
  if (schema.type === 'object') {
    return Object.entries(schema.properties ?? {}).reduce((result, [key, value]) => {
      return { ...result, [key]: buildExample(value as OpenAPIV3.SchemaObject) }
    }, 
  
  return undefined
}

How Chromatic Fits into the Workflow

Once set up, the Chromatic workflow is straightforward:

  1. A developer opens a pull request
  2. CI builds the Storybook and uploads it to Chromatic
  3. Chromatic runs visual comparisons against the baseline
  4. Any detected UI changes are surfaced directly in the PR (as below)
  1. From there, snapshots before and after with the changes highlighted can be viewed in storybook
  1. The changes need to be Approved in Chromatic which makes visual review explicit and intentional.

Lessons Learned So Far

  • Good Storybook coverage is essential. Chromatic relies on developers creating and maintaining stories for newly developed components, and Storybook itself needs ongoing care to remain a useful and accurate representation of the UI.
  • Entire pages can be snapshot tested as a single component, but doing so requires a fair amount of mocking (Which we have automated)
  • Baseline discipline is important and accepting visual changes should be a deliberate action.
  • Chromatic is most effective when treated as part of a broader testing strategy, rather than a silver bullet.
  • Finally, how restrictive or unobtrusive Chromatic feels in day-to-day development depends largely on how it is configured. Decisions such as whether to enable Visual Tests, and how approvals are gated all influence both the cost of the tool and its impact on developer workflow.

Final Thoughts

As our Remix application continues to evolve, Chromatic gives us confidence that UI changes are intentional and understood, without relying solely on manual checks.

Migrating to Remix was a natural point to rethink how we test our frontend. By adding Chromatic to our toolchain, we’re reducing an important gap in regression testing, one that traditional tests struggle to cover effectively. Visual regression testing doesn’t remove the need for thoughtful development or careful review, but it does make those processes more reliable and scalable.

Patterns & Best Practices in Event-Driven Systems

Designing Robust, Scalable, Maintainable Event Architectures

Event-driven architecture (EDA) gives teams the ability to build decoupled, scalable systems that evolve independently. In the previous article, we introduced the idea using a restaurant analogy: instead of shouting instructions across the kitchen, teams place “dockets” on the rail and stations take what they need.

We’ll continue that analogy lightly in this post—sprinkling it here and there—while focusing on the engineering patterns that make event-driven systems work in practice.

Core Patterns in Event-Driven Architecture

Pattern 1: Event Notification

An event notification is a tiny message that simply declares “something happened.”

It doesn’t contain all the details—just enough for downstream systems to react. Think of it like a kitchen bell dinging: The bell doesn’t contain the meal. It’s just a signal. The cook still needs to check the ticket rail (the database) for the details of the order 12345. Example

{
  "eventName": "OrderCreated",
  "orderId": 12345,
  "createdAt": "2025-11-26T01:00:00Z"
}
  • Why it’s useful
    • Extremely lightweight
    • Easy to publish, easy to fan out
    • Consumers decide how much extra data they need
  • Trade-offs
    • Consumers must fetch details themselves
    • More cross-service calls → more coupling
    • Higher latency when many consumers query upstream systems

Use this pattern when the event is a simple trigger—like a bell, not a full meal.

Pattern 2: Event-Carried State Transfer (ECST)

In Event-Carried State Transfer, the event carries all required data so consumers don’t need to make additional calls.

It’s the equivalent of the chef not only ringing the bell but also placing the complete plated dish on the pass. No one needs to ask questions—everything needed is right there.

{
  "eventName": "OrderPacked",
  "orderId": 12345,
  "items": [
    { "sku": "ABC123", "qty": 2 }
  ],
  "warehouseId": 19,
  "totalWeightGrams": 1850
}
  • Why it’s powerful
    • Zero need for back-calls → full decoupling
    • Highly resilient—consumers can process events even if upstream is down
    • Faster pipelines, fewer moving parts
  • Trade-offs
    • Larger event payloads → more bandwidth/storage
    • More careful schema management
    • Potential latency/throughput impact in high-volume streams

You’re essentially pre-plating the data, which costs more effort upfront, but saves everyone time downstream.

Pattern 3: Event Sourcing

Instead of storing only the current state, Event Sourcing stores every change as an immutable event.

State is rebuilt by replaying the events.

Just like a kitchen’s order history tells the complete story of what happened throughout service, event sourcing gives you a full timeline of every change.

Example (C# Aggregate Rehydration)

var events = eventStore.LoadStream("Order-12345");
var order = OrderAggregate.Rehydrate(events);
  • Why it’s valuable
    • Perfect audit trail
    • Time-travel debugging
    • Ability to replay events for recovery or analytics
  • Trade-offs
    • Higher cognitive load for newcomers
    • Requires rigorous versioning
    • Requires maintaining projections/read models

CQRS Note

Event Sourcing often pairs with CQRS—splitting commands (writes) from queries (reads).

It’s like chefs cooking in the kitchen while waitstaff maintain menus, tables, and customer-facing views.

Each side does what it’s optimized for.

Pattern 4: Choreography (Decentralised Workflow)

With choreography, services react to each other’s events without a central coordinator.

It’s like a well-trained kitchen crew: when the grill station finishes cooking a steak, the garnish station knows it’s their turn—without anyone shouting instructions.

  • Benefits
    • Fully decoupled
    • Naturally scalable
    • Easy for new services to join by subscribing
  • Drawbacks
    • Harder to visualize the full workflow
    • Risk of event spaghetti
    • Difficult to enforce global ordering or handle cross-service failures

Great for simple flows where each station knows what to do next.

Pattern 5: Orchestration (Service Composer / Workflow Engine)

Orchestration introduces a conductor—a central service that coordinates each step of the workflow.

Think of it like a head chef calling out the steps during a complex dish:

“Start the sauce.”

“Grill the chicken.”

“Plate it.”

The orchestration engine takes responsibility for the ordering and coordination.

public class DispatchOrchestrator 
{
    public async Task Handle(OrderPaid evt)
        => await Send(new ReserveStock(evt.OrderId));

    public async Task Handle(StockReserved evt)
        => await Send(new BookShipment(evt.OrderId));

    public async Task Handle(ShipmentBooked evt)
        => await Send(new MarkOrderReady(evt.OrderId));
}
  • When orchestration is ideal
    • Multi-step workflows
    • Processes requiring retries and compensation
    • Compliance requirements → clear traceability

Choreography scales. Orchestration brings order to complexity. Many systems end up using both.

Best Practices for Event-Driven Systems

Idempotency Everywhere

Events may be delivered more than once.

Consumers must behave safely even if they “see the same order twice.”

Just like the kitchen must avoid making the same dish twice if the order docket is accidentally duplicated.

if (db.HasProcessed(evt.Id)) return;
Process(evt);
db.MarkProcessed(evt.Id);

In high-throughput, distributed systems, rely on unique constraints on the event ID (or a combination key) in the MarkProcessed step. This guarantees atomicity and prevents race conditions if two consumers attempt to process the event simultaneously.

Durable, Replayable Streams

Use platforms that retain events reliably:

  • Kafka
  • AWS EventBridge + SQS
  • Pulsar
  • EventStoreDB

Replay is the equivalent of reviewing the order history after service to understand what happened.

Explicit Event Versioning

Events evolve as the business evolves.

Always version your events.

{
  "eventName": "OrderCreated",
  "version": 3,
  "orderId": 12345
}

This is like updating the recipe book—you need to know which version was used.

Event Contract Management (Schema Evolution)

Managing the schema itself is a real operational challenge.

Common solutions

  • Schema Registry (Confluent, AWS Glue)
  • Avro / Protobuf with compatibility modes
  • Automated consumer-driven contract tests

Just as a restaurant must keep recipes and menus consistent across teams, event schemas must stay compatible across services.

Domain-Driven Event Naming

Good events describe meaningful business events—not technical state changes.

✔ OrderPaid

✔ ShipmentDispatched

✔ StockShortageDetected

These read like “kitchen tickets”—instantly meaningful across teams.

Correlation IDs

Attach a correlation ID that follows the event across the system.

It’s your equivalent of an order number in a busy restaurant—the thing that ties together all actions associated with a single request.

x-correlation-id: d387f799e001-4a12-a3f1

Why Correlation IDs are Essential

In a decoupled EDA, the logical flow of a single business request is spread across multiple services, message queues, and logs. Without a correlation ID, this flow is almost impossible to trace.

  1. Distributed Debugging: If a customer reports a failure on order 12345, you can search your centralised logging system (like Splunk or ELK) using the correlation ID and instantly retrieve every log line, from every service, that contributed to that order's fulfillment.
  2. Request Tracing: They are the backbone of Application Performance Monitoring (APM) tools, which visualize the end-to-end path, latency, and dependencies of a request across your entire system.
  3. Cross-System Auditing: They provide the non-repudiable link between an incoming API call and the final persistent action (e.g., database write or shipment creation), fulfilling compliance needs.

A system without correlation IDs is a black box. They are the single most important tool for turning a distributed system into something observable and debuggable.

Conclusion

Event-driven architecture unlocks scalability, resilience, and autonomy across teams. By understanding patterns like event notification, ECST, event sourcing, choreography, and orchestration, you can match your workflow’s needs to the right design.

The light kitchen analogy highlights what makes EDA so powerful: each station works independently, yet the whole system flows smoothly.

Combined with strong practices—idempotency, schema governance, replay, and correlation—these patterns help systems evolve with confidence even under rapid growth.

Moving from Django DRF to Ninja API / Pydantic

As our project grows, we're always looking for ways to streamline development, improve performance, and enhance the developer experience. Recently, we've been exploring a shift from our traditional Django REST Framework (DRF) API patterns to a combination of Django Ninja API and Pydantic. This blog post will delve into our motivations for this change, the benefits we've observed, and some considerations for others contemplating a similar transition.

Why Consider a Change from Django DRF?

Django REST Framework has been a robust and widely adopted solution for building APIs with Django. It provides a comprehensive set of tools, including serializers, viewsets, and excellent browser-based API interfaces. However, as our needs evolved, we identified areas where a different approach could offer advantages:

  • Boilerplate Code: While DRF offers powerful abstractions, creating serializers, views, and viewsets can sometimes lead to a significant amount of boilerplate code, especially for simpler APIs.

  • Performance: For certain use cases, the overhead of DRF's serializer validation and rendering can impact performance, particularly in high-throughput scenarios.

  • Modern Python Features: We were keen to leverage modern Python features like type hints and data validation more extensively, which are core to Pydantic.

  • Developer Experience: A more concise and explicit way to define API endpoints and data structures could improve developer productivity and reduce potential errors.

Introducing Django Ninja API and Pydantic

Django Ninja API

Django Ninja is a web framework for building APIs with Django and Python 3.6+ type hints. It's heavily inspired by FastAPI and offers a number of compelling features:

  • Type Hinting for API Endpoints: You define your request and response models using Pydantic, and Ninja automatically validates and serializes the data based on these type hints.

  • Automatic OpenAPI (Swagger) Documentation: Just like FastAPI, Ninja generates interactive API documentation out of the box, making it easy to explore and test your API.

  • Fast Performance: Ninja is designed for speed, with minimal overhead and efficient request/response handling.

  • Simplified View Definitions: API endpoints are defined as simple Python functions, reducing the complexity often associated with DRF viewsets.

Pydantic

Pydantic is a data validation and settings management library using Python type hints. It's incredibly powerful for:

  • Data Validation: Automatically validates data against defined schemas, ensuring data integrity and catching errors early.

  • Serialization/Deserialization: Easily converts Python objects to and from JSON (or other formats) based on the defined models.

  • Runtime Type Checking: While Python's type hints are typically for static analysis, Pydantic brings runtime type checking to your data.


The Combo: Ninja API and Pydantic in Action

The synergy between Django Ninja API and Pydantic is where the real magic happens. Pydantic models define the structure and validation rules for both incoming request data and outgoing response data. Django Ninja then uses these Pydantic models to automatically:

  1. Validate Incoming Request Body: Any data sent to your API endpoint is automatically validated against the specified Pydantic model. If the data doesn't conform, a clear validation error is returned.

  2. Serialize Outgoing Response Data: When you return data from your API endpoint, Ninja uses the response Pydantic model to serialize it into the appropriate format (e.g., JSON).

  3. Generate OpenAPI Documentation: The Pydantic models directly contribute to the rich and accurate OpenAPI documentation, describing the expected request body and the structure of the responses.

The Core Shift: From Serializers to Schemas

This is the most significant change when moving from DRF to Ninja.

In DRF, a Serializer handles both input validation and output serialization.

# DRF Serializer
from rest_framework import serializers
from .models import Product

class ProductSerializer(serializers.ModelSerializer):
    class Meta:
        model = Product
        fields = ['id', 'name', 'price']

# DRF ViewSet
from rest_framework import viewsets

class ProductViewSet(viewsets.ModelViewSet):
    queryset = Product.objects.all()
    serializer_class = ProductSerializer

In Django Ninja, you use Pydantic Schemas for validation and a separate ModelSchema for serializing Django models.

Migrating from Django DRF to Django Ninja: A Developer's Guide

For years, Django REST Framework (DRF) has been the go-to for building robust APIs in Django. It's a powerful, feature-rich library with a massive community and a well-established pattern of Serializers, ViewSets, and Routers. But a new contender has emerged, inspired by the speed and simplicity of FastAPI: Django Ninja. If you're considering a switch, you're not alone. This guide will walk you through the key differences and how to make the move, leveraging the power of Pydantic.

Why Make the Switch?

DRF is a fantastic tool, but it can be verbose. The typical workflow often involves creating a Serializer class for data validation and serialization, a ViewSet for handling CRUD logic, and then a Router to generate the URLs. This can lead to a lot of boilerplate code, even for simple endpoints.

Django Ninja, on the other hand, is built on Pydantic and Python type hints. This modern approach offers several compelling benefits:

  • Less Boilerplate: You define your API endpoints as simple functions, using type hints for request and response data. Pydantic handles the heavy lifting of validation and serialization, drastically reducing the amount of code you need to write.

  • Automatic Documentation: Just like FastAPI, Django Ninja automatically generates interactive OpenAPI documentation (Swagger UI and ReDoc) from your type-hinted code. This means no more manual documentation or separate packages.

  • Intuitive & Explicit: The code is highly readable and explicit. Instead of relying on ViewSet magic, you define each endpoint with a simple decorator (@api.get, @api.post, etc.).

The Core Shift: From Serializers to Schemas

This is the most significant change when moving from DRF to Ninja.

In DRF, a Serializer handles both input validation and output serialization.

from rest_framework import serializers
from .models import Product

# DRF Serializer
class ProductSerializer(serializers.ModelSerializer):
    class Meta:
        model = Product
        fields = ['id', 'name', 'price']


# DRF ViewSet
from rest_framework import viewsets

class ProductViewSet(viewsets.ModelViewSet):
    queryset = Product.objects.all()
    serializer_class = ProductSerializer

In Django Ninja, you use Pydantic Schemas for validation and a separate ModelSchema for serializing Django models.

# Django Ninja Schemas
from ninja import Schema, ModelSchema
from .models import Product

# For request payload validation (input)
class ProductIn(Schema):
    name: str
    price: float

# For response data (output)
class ProductOut(ModelSchema):
    class Config:
        model = Product
        model_fields = ['id', 'name', 'price']

The ModelSchema is particularly powerful as it automatically creates a Pydantic schema based on your Django model, handling the conversion for you. This means you can often return a Django QuerySet or model instance directly, and Ninja will use the response schema you defined to handle the serialization.

A Practical Example: Migrating a CRUD Endpoint

Let's imagine a simple API for a Product model.

DRF Pattern (the old way):

Model (models.py):

from django.db import models

class Product(models.Model):
    name = models.CharField(max_length=255)
    price = models.DecimalField(max_digits=10, decimal_places=2)

Serializer (serializers.py):

from rest_framework import serializers
from .models import Product

class ProductSerializer(serializers.ModelSerializer):
    class Meta:
        model = Product
        fields = '__all__'

ViewSet (views.py):

from rest_framework import viewsets
from .models import Product
from .serializers import ProductSerializer

class ProductViewSet(viewsets.ModelViewSet):
    queryset = Product.objects.all()
    serializer_class = ProductSerializer

URLs (urls.py):

from django.urls import path, include
from rest_framework.routers import DefaultRouter
from .views import ProductViewSet

router = DefaultRouter()
router.register('products', ProductViewSet)

urlpatterns = [
    path('', include(router.urls)),
]

This setup automatically generates all your CRUD endpoints, which is a key advantage of DRF, but also abstracts away a lot of the implementation.


Django Ninja Pattern (the new way):

Model (models.py): Remains the same.

API Logic (api.py): This is where everything happens.

from ninja import NinjaAPI, ModelSchema, Schema
from typing import List
from .models import Product

api = NinjaAPI()

class ProductIn(Schema):
    name: str
    price: float

class ProductOut(ModelSchema):
    class Config:
        model = Product
        model_fields = ['id', 'name', 'price']

@api.post("/products", response=ProductOut)
def create_product(request, payload: ProductIn):
    product = Product.objects.create(**payload.dict())
    return product

@api.get("/products", response=List[ProductOut])
def list_products(request):
    return Product.objects.all()

@api.get("/products/", response=ProductOut)
def get_product(request, product_id: int):
    return Product.objects.get(id=product_id)

@api.put("/products/", response=ProductOut)
def update_product(request, product_id: int, payload: ProductIn):
    product = Product.objects.get(id=product_id)
    for attr, value in payload.dict(exclude_unset=True).items():
        setattr(product, attr, value)
    product.save()
    return product

@api.delete("/products/")
def delete_product(request, product_id: int):
    product = Product.objects.get(id=product_id)
    product.delete()
    return {"success": True}

URLs (urls.py):

from django.urls import path
from .api import api

urlpatterns = [
    path("api/", api.urls),
]

This setup is more manual, but the logic is right there in the function. You can clearly see the input (payload: ProductIn) and the expected output (response=ProductOut), making the code self-documenting.


Benefits We've Experienced

Since adopting this combo, we've observed several significant improvements:

  • Reduced Boilerplate: We can define a complete API endpoint with input validation, output serialization, and automatic documentation in a much more concise way.

  • Simple API implementation: When you look at it at a glance, it almost looks like something that is more of a FastAPI implementation. We want to keep things simple, define your endpoints, your expected query parameter and data, and report back the response. At a glance for one endpoint, it looks very clear and nothing is hiding you from correctly guessing what this endpoint is looking for and what it is responding

  • Less magic / hidden operations: Django DRF is very powerful - however as for every framework, it hides certain things that some developers might not be aware of, often catching them off guard when trying to trace things. Pydantic makes it clearer 

  • Enhanced Type Safety: Leveraging type hints with Pydantic has made our API code more robust and less prone to common data-related errors. It's also made our codebase easier to understand and maintain. It also identifies any type errors on building time, reducing any chances of runtime errors

  • Better Developer Experience: The automatic OpenAPI documentation and the explicit nature of Pydantic models have made it easier for developers to build and consume our APIs.

  • Better code encapsulation: As your code gets bigger, often you will find integrating your model everywhere is not going to be maintainable in the future (Think about modifying your model or putting a cache layer above it since it is a big model). With Ninja / Pydantic implementation, your model does not have to integrate with the API, furthermore you can try to implement service based pattern to separate the concern between API implementation and the actual business logic, which again makes it way cleaner

Considerations for Transition

While the benefits are clear for us, a transition isn't without its considerations:

  • Learning Curve: Developers familiar with DRF's serializer-heavy approach will need to adapt to Pydantic's data modeling.

  • Existing Codebase: Migrating an existing, large DRF codebase to Ninja/Pydantic requires a thoughtful strategy, potentially taking an iterative approach.

  • Ecosystem Maturity: While both Django Ninja and Pydantic are mature and widely used, DRF has a larger and more established ecosystem of plugins and community support.


Conclusion

The move from Django DRF API patterns to the Django Ninja API and Pydantic combo has been a positive step for our team. It has allowed us to build more performant, maintainable, and type-safe APIs with a better developer experience. While DRF remains a powerful tool, for our current and future needs, the elegance and efficiency of Ninja and Pydantic are proving to be a winning combination.

We encourage other Django developers to explore this powerful duo, especially if you're looking to modernize your API development workflow and leverage the full potential of Python type hints.

Four Years Strong: Celebrating Our Koganniversaries

In a talent landscape full of competitive opportunities, where change and turnover are part of the norm, Kogan.com stands out as a place where people choose to stay, grow, and advance their careers. Many of our team members have long tenures, with some contributing as long as 12 or 15 years, reflecting the strong culture and opportunities here.

This month, we celebrated three team members reaching their four-year milestone. To mark the occasion, we spoke with them about what has made their journey so rewarding, what has kept them at Kogan.com, and what continues to inspire and excite them as part of our team.

Adam Slomoi

Adam is currently the Tech Lead of a squad. He joined Kogan as a Software Engineer and quickly progressed to Senior Software Engineer, and most recently to Tech Lead. In this role, he continues to sharpen his technical skills while growing his passion for people leadership.

How has your role or perspective on engineering evolved over the last four years?I’ve been fortunate to work across different areas of the business and on various components of our system during my time at Kogan. One theme that has remained consistent throughout, and that I’ve gained a greater appreciation for, is the focus on business outcomes.

How did you feel stepping into your first leadership role, and what did you learn from the experience?
I felt well prepared before officially taking on a leadership role. We have a very collaborative team, and there have been many opportunities along the way to have a say and help set the direction for the team.

Sam O’Halloran

Sam started at Kogan in his first formal software engineering role straight out of university. He has honed his technical skills, contributed significantly to his team, and grown in both technical depth and breadth, applying best practices efficiently.

Which project are you most proud of, and what made it exciting or unique?
I’m most proud of the work on product variants. It wasn’t one big launch, just lots of smaller fixes that added up. I tightened grouping logic in pipelines and batch jobs, cleaned up SPS edge cases, and ensured changes flowed through to OpenSearch. On the UI side, I tweaked filters, added a simple horizontal selector for single-variant products, enabled a promo palette, fixed dropdown ordering, and added a VariantGroup sitemap for better SEO. It was satisfying because variants are messy in the real world, and these changes made choosing the right option faster and clearer for customers, and saner for our internal teams.

Can you share a moment where your work made a noticeable impact for the team or the product?
I led a performance pass on our Product Listing Page. I cut work above the fold, switched tiles to responsive images, added light skeletons, virtualised the brand filter with React Virtuoso, lazy-rendered heavier lists, and fixed scroll and CLS issues. The page now loads faster and filtering feels smoother, which reduced the performance issues we were tracking on that screen.

I also flagged follow-ups for our Product List endpoint to make it simpler, faster, and more reliable.

Yanxu Zheng

Yanxu is a highly skilled Senior Software Engineer who has developed deep technical expertise. He was recently promoted to a tech lead role and is now responsible for leading a large squad.

What’s one of the most interesting technical challenges you’ve solved during your time here, and how did you approach it?
Our keyword-based search sometimes failed to grasp user intent, leading to irrelevant results. I tackled this by leading an experiment to test whether a hybrid search model combining keyword and semantic matching could perform better. To avoid any risk to live services, I ran the experiment on a parallel cluster that mirrored production data. We then conducted an A/B test comparing the new hybrid search against the existing keyword search.

The results were clear: while the new model showed a modest improvement in relevance for some queries, it produced no statistically significant lift in user conversion. Based on the data, we decided not to roll out the feature, as the additional infrastructure cost wasn’t justified by the lack of business impact. My key takeaway was that a technical improvement is only valuable if it moves a core business metric. This experience reinforced my approach of always tying engineering efforts to measurable business outcomes.

Is there a problem you tackled that taught you a new skill or changed how you think about engineering?
I once fixed major performance issues in our OpenSearch cluster by challenging the official best practices. The standard advice was to use 10–50GB shards, but our search latency was poor. I ran experiments and proved that for our specific workload, smaller 5–10GB shards were far more efficient. Switching to smaller shards cut our query latency by over 20%. The experience taught me to always validate standard guidelines with real-world data, as optimal solutions are context-dependent.

Building Your Own AI Agent

The DEBI (Data Engineering and Business Intelligence) team recently attended the DataEngBytes 2025 conference, where the hot topic for the year was, unsurprisingly, AI agents. My favorite talk, by Geoffrey Huntley, presented a powerful and surprisingly simple idea: It’s not that hard to build an agent; it’s a few hundred lines of mainly boilerplate code running in a loop with LLM tokens. That’s all it is!

Kogan DEBI team at the DataEngBytes conference 2025

The speaker’s main point was that things were developing extremely rapidly in the AI space, but rather than worrying about how AI might take engineering jobs in the near future, we should become AI producers, leveraging agentic AI to automate things, from data pipelines to our own job functions. Understanding this is, in his words, "perhaps some of the best personal development you can do this year."

This idea is both liberating and empowering. It transforms the conversation from one of anxiety about job security to one of excitement about a new, fundamental skill. Let's pull back the hood on how these agents work and understand the simple primitives that allow us to become producers of automation, not just consumers.

The Fundamentals: The Shift from a Tool to a System

Before you write any code, you need to understand the new paradigm. We’re moving beyond just using Generative AI (Gen AI) as a tool and are now using it to build a complete system: an AI Agent.

Generative AI (Gen AI): The Creator: This is the broad category of AI models that are designed to create new content. LLMs are the most common form of this. They are reactive; you give them a prompt, and they generate a response—be it text, code, or an image. Gen AI is the creative engine.

Agentic AI: The Doer: This is a type of AI system that is designed to act with autonomy. You give it a high-level goal, and it uses its "brain" (a Gen AI model) to reason, plan, and execute actions to achieve that goal. This is the proactive part of AI. The speaker referred to these as "digital squirrels" because they are biased toward action and tool-calling, focusing on incrementally achieving a goal rather than spending a long time on a single thought.

The key insight is that the most powerful Agentic AI systems use a highly-agentic Gen AI model (like Claude Sonnet or Kimi K2) as their core decision-maker, and then wire in other, more precise Gen AI models (the "Oracles") as tools for specific tasks, like research or summarization.

The Agent's Heartbeat: The Inferencing Loop

An agent's core function is an elegant, continuous loop. It's the same loop that powers every AI chat application, but with one critical addition: the ability to execute tools.

  1. User Input: The agent takes a prompt from the user.
  2. Inference: It sends the prompt, along with the entire conversation history, to the LLM.
  3. Response Analysis: It receives a response. If the response is a direct answer, it's printed to the user.
  4. Tool Execution: If the response is a "tool use" signal, the agent interrupts the conversation, executes the specified tool (a local function), and then sends the result back to the LLM to continue the conversation.

This simple, self-correcting loop is the engine that drives all agentic behavior.

The Building Blocks: Primitives of a Coding Agent

The power of an agent comes from its tools. A tool is simply a local function with a description, or "billboard," that you register with the LLM. The LLM's training nudges it to call these functions when it believes they are relevant to the user's request. The workshop demonstrates five fundamental tools for building a coding agent.

  • Read Tool: This tool reads the contents of a file into the context window. It's the first primitive, allowing the agent to analyze existing files.

  • List Tool: This tool lists files and directories, giving the agent awareness of its environment, much like an engineer running ls to get their bearings.

  • Bash Tool: This powerful tool allows the agent to execute shell commands, enabling it to run scripts, check processes, or interact with the system. It's the key to making the agent's work actionable.

  • Edit Tool: This tool allows the agent to modify files by finding and replacing strings or creating new files. When combined with the other tools, it completes the agent's ability to act on the codebase.

  • Search Tool: This tool uses an underlying command-line tool like ripgrep to search the codebase for specific patterns. It helps the agent quickly find relevant code without having to read every file.

Putting It All Together: The FizzBuzz Example

By combining these primitives, an agent can perform complex, multi-step tasks. In his talk, Geoffrey illustrated this by having an agent solve the classic programming problem of FizzBuzz. This is a classic programming exercise that requires a program to print numbers from 1 to 100, with a few simple exceptions: for multiples of three, print "Fizz" instead of the number; for multiples of five, print "Buzz"; and for numbers that are multiples of both three and five, print "FizzBuzz."

By giving the agent the prompt, "Hey Claude, create fizzbuzz.js that I can run with Nodejs and has fizzbuzz in it and executes it," we are asking it to orchestrate a multi-step process. The agent will use the Edit Tool to create the file and then the Bash Tool to execute the script and verify the output. The speaker then took it a step further, asking the agent to amend the code to only run to a specific number. The agent successfully handled this by using the Read Tool to check the existing code, the Edit Tool to change it, and the Bash Tool again to verify the new output. This ability to continuously loop back on itself to correct and refine its work is the key to a true agentic system.

The Career Implications for Engineers

In the last six months, AI has become "incredibly real," and the ability to build these systems lets us become producers of automation, not just consumers of it. And here's the best part: the skills are completely transferable. The same principles used for a code-editing agent can be applied to automating data pipelines, CI/CD workflows, database management or even parts of our core job functions.

The final message from the talk was super clear: this technology is here, and it’s surprisingly accessible. As engineers, our value in the coming years is going to be defined by our ability to use and produce automation. The most successful Engineers aren't the ones who fear AI, but the ones who embrace it and learn to build these powerful tools. There’s nothing mystical about agents; they're just an elegant loop built on a few core principles. The next step is to start building one yourself.

Further Resources

https://ghuntley.com/agent/ https://ampcode.com/how-to-build-an-agent

Order Dispatch Systems at Scale

How to load-balance like a seasoned waiter

Software systems often parallel the real world. Imagine running a busy restaurant, where customers line up to make orders whilst the kitchen prepares the meals. In the software world, your users are the customers, and your backend services are the kitchen. With more people online than ever before, that line might start to grow out the front door. The ability to scale is no longer optional, it is essential.

Know Your Options

Vertical Scaling - Scale Up Horizontal Scaling - Scale Out
Expanding your restaurant by adding more tables or a larger kitchen. In software terms, this means scaling up your infrastructure. More powerful CPUs, larger memory, increased throughput etc. This is a relative quick fix, but comes with diminishing returns and limits on how big everything can get. Opening new restaurant locations to serve more customers simultaneously and distribute existing flows. In software terms, this means adding more API servers, more worker nodes or creating many database replicas. This approach is more flexible and scalable than vertical scaling in the long term.

A Steppingstone - Queues

Just like how customers queue for their order, we create a pull-based task queue for our order management system using AWS Simple Queue Service (SQS). Tasks get queued into the SQS, and a consumer service will continuously poll this queue to process the tasks.

This gives a lot of control for the queue consumer to dictate the frequency of polling, which works well in systems that cannot handle high throughput or requires non-concurrency like the SAP ERP (more on that later). SQS also provides built-in dead-letter-queues, retry policies, at least once delivery guarantee and scales automatically.

Vertical scaling involves sizing up the compute power of the consumer (CPU, RAM etc). Horizontal scaling involves spinning up more consumers of the SQS.

However, queues have limitations:

  • Latency between order arrival and processing. 

  • Inefficient polling process that checks constantly even when the queue is empty.

  • Limited fan out, message is designed to be consumed by one service making it difficult for different components to react to the same event.

  • Concurrency issues, need to tune the message visibility timeout to ensure it isn’t picked up by another consumer instance when scaling out.

The Gold Standard - Events

Scalability starts at the architectural level; enter event-driven programming. Instead of queuing up, customers scan a QR code, and their order is sent instantly to the kitchen, the waiter, the pay desk all at once. No delay, no queues.

We recreate this by having an event publisher that sends messages to an event bus. Which notifies the event subscribers: warehouse, emailing service and SAP simultaneously, allowing them to react to the same order independently. Adding the previous vertical and horizontal scaling options mentioned above, creates a powerful system for processing and dispatching orders as the separate components can be scaled independently. Which also lends well to a microservices architecture.

This model mitigates a lot of the previous approach’s deficiencies:

  • Lower latency, as it is push based not pull based.

  • Subscribers can be scaled individually.

  • Event bus is built to handle concurrency.

There are two ways to implement this in AWS: EventBridge and SNS (Simple Notification Service). We choose EventBridge for its ability to handle more complex workflows and native integrations with third party SaaS applications like Zendesk. 

Unlike the SNS approach where messages must be published to a specific topic and risks the number of topics growing too large; EventBridge receives from many sources at once. With advanced filtering capabilities, it can inspect the full event payload and route them to the appropriate consumers. Additionally, event archiving and replays are also supported for improved debugging. 

Here is the basic implementation:

  • Publish events to the EventBridge from your application

  • Define Event Rules that filter events based on its payload

  • Configure Targets for each rule - they can have multiple targets

Our target will be SQS as it allows our preexisting .NET services to plug into the new event-driven system without major modifications. However, serverless lambda functions are on the table if we can remove the dependency on SAP, more on this later.

While powerful, event-driven architecture is not without its drawbacks:

Debugging and tracing: Events are asynchronous and loosely coupled making it difficult to find cause and effect. Need to set up comprehensive logging and distributed tracing. 

Eventual consistency: System components may be temporarily out of sync. Making it harder to understand behaviour. Logic must also be built to handle stale data gracefully.

Event schema evolution: Changing the payload structure can cause breaking impacts for downstream services. Need to document clearly how the payload is consumed and have a versioning strategy.

Despite these challenges, with the right tooling and implementation, event-driven architectures can be made highly observable, testable and resilient. 

A Slow Chef in the Kitchen

Sometimes, the bottleneck isn’t your system but it’s external dependencies. In our case, SAP is that slow chef. The Data Interface API is single threaded and does not support batch processing. If you throw too many requests, it will choke no matter how fast other components are. Identification of such bottlenecks are crucial, lest those other scaling efforts are wasted. 

Luckily, in our case we can upgrade SAP from the DI API to a modern alternative called the Service Layer. It is designed with scalability in mind:

  • Uses HTTP and OData protocols

  • Can parallel process

  • Automatic load-balancing

  • Does not require local installation like DI API

These properties make it much easier to develop web and mobile applications which are more accessible than the SAP Windows client. The service layer’s more stateless nature lends nicely to the aforementioned event-driven architecture, bringing SAP in line with the rest of our scalable system.

Conclusion

Just like a restaurant, software systems should be designed with future scalability in mind. Start with simple abstractions like task queues and evolve to fully decoupled event-driven systems. Horizontal scaling is often better and more flexible than vertical scaling. When faced with external bottlenecks, tackle them head-on. Architecture is the business, with intentional design, your restaurant won’t just keep up but thrive.

Empowering Data Through Self-Service: Behind the Scenes of Our Data Platform

At Kogan.com, our data needs have grown alongside the business. As more teams relied on insights to move quickly, it became clear our request-based BI model couldn’t scale. We needed a platform that empowered teams to answer their own questions, trust the numbers, and move independently. That journey led us to build a self-service platform grounded in governance, transparency, and scalability—powered by dbt, Looker, and Acryl (DataHub).

Rethinking Our BI Model

We originally relied on Tableau. It served us well but had limitations: duplicated logic, inconsistent metrics, and limited collaboration with dbt. Tableau workbooks weren’t version-controlled, which made maintaining consistency difficult. To bridge modeling and reporting, we often created extra presentation tables in dbt, adding complexity. We needed a platform that integrated tightly with dbt and supported governed exploration.

A New Architecture: Modular, Transparent, Scalable

We redesigned the platform around a clean, modular flow: Raw Sources → BigQuery → dbt → Looker → Acryl (DataHub) Our data transformations are built in dbt, where we follow a layered modeling structure. While we use stg_ (staging) and int_ (intermediate) models primarily for data cleaning and standardization, the marts_ models are the ones that power our analysis and reporting. These models contain our fact and dimension tables, fully aligned with business logic and ready for consumption in Looker. We’ve integrated CI/CD pipelines using GitHub Actions, and every change is tested before deployment. This includes dbt tests, schema validations, and model documentation to ensure confidence at every layer.

Why Looker Was the Right Fit for Self-Service

Looker offered a structured, governed approach that aligned with our dbt-first architecture. LookML let us centralize business logic, version it with Git, and deploy changes through CI/CD. With support for multiple environments (UAT and Production), we can test safely before releasing to users. The Explore interface gives business users guided access to curated datasets—no SQL required. Users can drill down, apply filters, and explore confidently. This was a big shift from the Tableau model, which often required analyst support. Looker also includes row-level security, role-based access, and an AI assistant that supports natural language queries and chart generation—lowering the barrier for non-technical users. We’ve also developed internal dashboard standards—consistent layouts, filters, and naming conventions—to ensure usability and reduce support needs.

Bridging the Migration

We’re currently in the process of migrating our Tableau reports into Looker. While Looker is more efficient to build with, the migration isn’t just about re-creating dashboards—we’re using it as an opportunity to improve them. For each report we migrate, we review and sometimes refactor the associated dbt models to ensure the logic is clean, reusable, and well-documented. We also take time to redesign visual layouts to be more intuitive and self-service friendly—adding better filters, descriptive labels, and drill-down paths wherever we can. It’s not just a tech migration—it’s a platform and user experience upgrade.

Data Discovery and Observability with Acryl

Alongside dbt and Looker, Acryl (DataHub) has become the foundation of our metadata ecosystem. Acryl helps both technical and non-technical users understand what data exists, where it comes from, and who owns it. It provides searchable documentation, field descriptions, ownership metadata, and lineage tracing across dbt, BigQuery, and Looker. We also rely on Acryl’s observability features for monitoring anomalies and surfacing potential data quality issues. While it's not a test framework like dbt or a freshness tracker, Acryl helps us detect behavioral anomalies, unexpected changes, and broken relationships before they impact end users. Acryl's AI-powered documentation suggestions have also saved us time when onboarding new models or enhancing existing ones, especially for adding descriptions and tags at scale.

Lessons We’ve Learned

If there’s one takeaway, it’s that data tools need infrastructure—both technical and human. You can’t just launch Looker and expect adoption. You need a warehouse that reflects the business, models that users can trust, documentation that’s visible, and governance that feels supportive, not restrictive. We also learned that governance isn’t about locking things down—it’s about making things clear. When users understand what data means, how it’s calculated, and who owns it, they feel empowered, not limited.

Final Thoughts

We didn’t build a self-service platform just to save time—we built it to build trust. By aligning tools like dbt, Looker, and Acryl into a unified ecosystem, we’ve created something bigger than a data stack. We’ve created a culture where teams are empowered to explore, ask better questions, and make faster decisions—without sacrificing governance or quality. This transformation didn’t happen in a vacuum. It was made possible by the incredible efforts of my team—engineering, analytics, and enablement working hand-in-hand. The commitment to transparency, maintainability, and user empowerment is what brought this platform to life. We’re still learning. But we’re proud of how far we’ve come—and even more excited about where we’re going.