Deep Dive: The Microservices Migration

The bug that made me rethink everything was embarrassingly simple: two people posted to the same bulletin board, and neither one could see the other’s messages. I stared at it for a while before I understood what was happening—each browser had its own localStorage, its own little universe of data, completely isolated from everyone else. Which is, of course, exactly how localStorage is supposed to work. Just not how a BBS is supposed to work.

That’s when I stopped pretending the emulator was a client-side toy. It needed to be a distributed system, which meant the state had to live somewhere authoritative, somewhere that wasn’t the browser. I’m still not entirely sure I picked the right architecture—you’ll see what I mean—but at least messages show up for everyone now.

Why I Went with Server Ownership

I spent a few days thinking about two approaches, and honestly neither one felt great.

The first was client-side replication: keep everything in localStorage but sync it across browsers via WebSockets. The appeal was obvious—no server infrastructure, no database, just peer-to-peer updates. But the more I thought about it, the messier it got. Every client owns the full dataset, so what happens when two people edit the same thing at the same time? You need conflict resolution, which means you need vector clocks or CRDTs or operational transforms, and suddenly your “simple” BBS emulator has a distributed consensus problem. I’ve been down that road before and it doesn’t end well. (I once spent three weeks debugging a collaborative editor that used operational transforms. The code worked perfectly until it didn’t, and when it didn’t, good luck figuring out why.)

The second approach was server ownership: put a database behind everything, build real APIs, and make the server the single source of truth. More infrastructure, more complexity, more things that can break at 3am. But also: no merge conflicts, no divergent state, no “your version says A but my version says B” conversations.

I went with server ownership because it maps to what a BBS actually is—a central system that everyone dials into. The browser becomes a terminal, not a peer. If you think about it that way, the architecture kind of designs itself. Whether that’s the right framing, I’m honestly not sure, but it’s the one I went with.

The cost is real infrastructure—Postgres, Redis (well, Dragonfly), NATS, three separate services. But the benefit is that when someone asks “who owns this message,” the answer is always “the messaging service,” and there’s never any ambiguity about it.

The Shape of the Stack

                                 ┌─────────────────┐
                                 │   PostgreSQL    │
                                 │   (Supabase)    │
                                 └────────┬────────┘
                                          │
┌──────────┐     ┌─────────────┐    ┌─────┴────────┐
│  Browser │────►│   Gateway   │───►│   Services   │
│ Terminal │ WS  │   (Axum)    │gRPC│   (Tonic)    │
└──────────┘     └──────┬──────┘    └──────────────┘
                        │                  │
                 ┌──────┴──────┐    ┌──────┴──────┐
                 │  Dragonfly  │    │    NATS     │
                 │   (cache)   │    │  (events)   │
                 └─────────────┘    └─────────────┘

I have an Axum gateway that handles WebSocket connections from browsers and translates their REST requests into gRPC calls. Behind that sit three Rust services, each on its own port, each responsible for one slice of the system. The users service on 50051 handles profiles and online presence—the “who’s online” directory you’d expect from any BBS. The messaging service on 50052 does BBS boards, Usenet-style newsgroups, and private email, which are different enough that I almost split them into separate services but similar enough that I didn’t. (I’m still not sure that was the right call. When I inevitably add threading to email, I might regret lumping them together.) The market service on 50053 handles stock quotes, portfolios, and file downloads—basically anything that involves external data or user assets.

The idea is that each service owns its domain completely. I have the gateway call UsersService::GetProfile without knowing or caring how profiles are stored. It could be Postgres, it could be a flat file, it could be a guy in a basement with a Rolodex—the gateway doesn’t know and doesn’t need to.

The Users Service

The users service manages profiles and online presence. Here’s the structure I ended up with:

// server/services/users-service/src/service.rs
pub struct UsersServiceImpl {
    pool: PgPool,
    cache: Option<Cache>,
    /// Fallback in-memory store when cache unavailable
    online_users: Arc<RwLock<HashSet<String>>>,
}

That online_users field is interesting—it’s a fallback for when Redis isn’t available. I wanted the service to degrade gracefully instead of just failing when the cache goes down, because cache servers do go down, usually at the worst possible moment. (My Dragonfly instance has been rock solid so far, but I’ve been burned by Redis outages before and I don’t trust anything anymore.)

For profile lookups, I try the cache first and fall back to the database:

async fn get_profile(
    &self,
    request: Request<GetProfileRequest>,
) -> Result<Response<Profile>, Status> {
    let handle = request.into_inner().handle;

    // Try cache first
    if let Some(profile) = self.get_cached_profile(&handle).await {
        return Ok(Response::new(profile));
    }

    // Cache miss - fetch from database
    let row = sqlx::query(
        r#"
        SELECT id, account_id, handle, display_name, location, bio, interests,
               created_at, last_seen_at, is_sysop
        FROM bbs_profiles
        WHERE handle = $1
        "#,
    )
    .bind(&handle)
    .fetch_optional(&self.pool)
    .await
    .map_err(BackendError::from)?
    .ok_or_else(|| BackendError::NotFound(format!("Profile not found: {}", handle)))?;

    // ... convert row to Profile, cache it, return
}

I set the cache TTL to 5 minutes for profiles. That’s long enough that I’m not hammering the database every time someone views a profile, but short enough that when someone updates their bio, the change propagates reasonably quickly. I honestly picked 5 minutes somewhat arbitrarily—it felt right, but I haven’t done any real load testing to see if it’s optimal. If you’re reading this hoping for rigorous benchmarking, I don’t have any.

The Messaging Service

The messaging service handles three things: BBS boards, Usenet-style newsgroups, and private email. They’re similar enough that I didn’t want three separate services, but different enough that they each get their own proto definitions:

// server/crates/backend-proto/proto/messaging.proto
service MessagingService {
    // BBS
    rpc ListBoards(ListBoardsRequest) returns (ListBoardsResponse);
    rpc ListBoardMessages(ListBoardMessagesRequest) returns (ListMessagesResponse);
    rpc CreateMessage(CreateMessageRequest) returns (BbsMessage);
    rpc DeleteMessage(DeleteMessageRequest) returns (DeleteMessageResponse);

    // Usenet
    rpc ListGroups(ListGroupsRequest) returns (ListGroupsResponse);
    rpc ListGroupArticles(ListGroupArticlesRequest) returns (ListArticlesResponse);
    rpc CreateArticle(CreateArticleRequest) returns (UsenetArticle);

    // Email
    rpc ListMailbox(ListMailboxRequest) returns (ListMailboxResponse);
    rpc SendEmail(SendEmailRequest) returns (EmailMessage);
}

When someone posts a message, there’s a bit of ceremony involved—I need to verify they’re authenticated, check if the board is read-only, and then invalidate the cache so the board listing reflects the new message count:

pub async fn create_message(
    pool: &PgPool,
    cache: Option<&Cache>,
    request: Request<CreateMessageRequest>,
) -> Result<Response<BbsMessage>, Status> {
    let user_id = auth::require_user_id(&request)?;
    let req = request.into_inner();

    // Check if board is read-only
    let read_only: bool = sqlx::query_scalar("SELECT read_only FROM bbs_boards WHERE id = $1")
        .bind(req.board_id)
        .fetch_optional(pool)
        .await
        .map_err(BackendError::from)?
        .ok_or_else(|| BackendError::NotFound(format!("Board not found: {}", req.board_id)))?;

    if read_only {
        return Err(BackendError::Forbidden("Board is read-only".to_string()).into());
    }

    // Insert message...
    // Invalidate boards cache (message counts changed)
    invalidate_boards_cache(cache).await;

    // Return the created message
}

That cache invalidation is doing more work than it looks like. The board listings include message counts, so whenever someone posts or deletes a message, the cached board list becomes stale. I could get clever and just update the count in the cache, but invalidation is simpler and message posting isn’t exactly a high-frequency operation. I’m erring on the side of “correct and simple” rather than “fast and complicated.” We’ll see if that holds up.

The Gateway Layer

The gateway’s job is to translate REST to gRPC. I wanted the frontend to stay blissfully ignorant of the backend’s opinions about protocol buffers and RPC semantics—it just talks HTTP like a normal web app.

A typical handler looks like this:

// server/src/handlers/backend_bbs.rs
pub async fn create_message(
    State(state): State<MessagingClient>,
    Path(board_id): Path<i32>,
    headers: HeaderMap,
    Json(body): Json<CreateMessageBody>,
) -> Response {
    // Extract user_id from JWT
    let user_id = match extract_user_id(&headers, &state.jwt_config) {
        Ok(id) => id,
        Err((status, msg)) => return (status, msg).into_response(),
    };

    let mut client = state.client.clone();

    let mut request = tonic::Request::new(CreateMessageRequest {
        board_id,
        parent_id: body.parent_id.unwrap_or(0),
        subject: body.subject,
        body: body.body,
    });
    add_user_metadata(&mut request, &user_id);

    let result = client.create_message(request).await;

    match result {
        Ok(response) => {
            let m = response.into_inner();
            (StatusCode::CREATED, Json(SchemaBbsMessage::from(m))).into_response()
        }
        Err(e) => {
            let status = match e.code() {
                tonic::Code::NotFound => StatusCode::NOT_FOUND,
                tonic::Code::PermissionDenied => StatusCode::FORBIDDEN,
                tonic::Code::Unauthenticated => StatusCode::UNAUTHORIZED,
                _ => StatusCode::INTERNAL_SERVER_ERROR,
            };
            (status, e.message().to_string()).into_response()
        }
    }
}

I do JWT validation in the gateway, not in the services. If the token is invalid, the request never reaches gRPC at all, which keeps the services focused on business logic rather than auth ceremony. Whether this is the right split, I’m honestly still figuring out—there’s an argument for pushing auth validation all the way down to the services for defense in depth, but that means every service needs to understand JWTs, and I didn’t want that coupling.

The error mapping from gRPC codes to HTTP status is explicit and deliberate: NotFound becomes 404, PermissionDenied becomes 403, Unauthenticated becomes 401, and everything else is a 500. I thought about being more granular, but most of the time a 500 with a message is good enough for debugging.

NATS for Real-Time Events

This is the part I’m least confident about, honestly. The idea is that NATS handles event distribution between services—when something changes, the service publishes an event, and anyone who cares can subscribe.

// server/crates/backend-common/src/nats.rs
pub mod topics {
    pub const BBS_MESSAGE_CREATED: &str = "backend.bbs.message.created";
    pub const BBS_MESSAGE_DELETED: &str = "backend.bbs.message.deleted";
    pub const USENET_ARTICLE_CREATED: &str = "backend.usenet.article.created";
    pub const EMAIL_MESSAGE_SENT: &str = "backend.email.message.sent";
    pub const USER_ONLINE: &str = "backend.users.online";
    pub const STOCK_QUOTE_UPDATED: &str = "backend.stocks.quote.updated";
}

The pattern works like this: services publish events when state changes, and the gateway subscribes to relay them to connected WebSocket clients. So if you’re watching board 3 and someone posts a message, you don’t have to poll—the messaging service publishes backend.bbs.message.created, the gateway receives it, and pushes an update to your browser.

What I like about this is the decoupling. The messaging service doesn’t need to know which clients are watching which boards. It just publishes to NATS and moves on. The gateway handles the fan-out. In theory this should scale nicely because adding more gateway instances doesn’t change anything about the services—they’re still just publishing events into the void.

What I’m less sure about is whether this is overengineered for my actual scale. Right now I have, what, maybe a dozen concurrent users on a good day? I could probably poll and be fine. But the NATS setup wasn’t that hard, and it feels like the right foundation if this ever grows. (Famous last words. I’ve said “this will scale when we need it to” about many things that later turned out not to scale at all.)

Docker Compose Deployment

The whole stack runs in containers, which makes local development surprisingly pleasant—docker compose up and you’re running the same thing that runs in production, more or less.

# docker-compose.yml (simplified)
services:
  users-service:
    build:
      context: .
      dockerfile: Dockerfile
      target: users-service
    environment:
      - BIND_ADDR=0.0.0.0:50051
      - NATS_URL=nats://nats.devlab.ca:4222
      - REDIS_URL=redis://dragonfly:6379
    ports:
      - "50051:50051"

  messaging-service:
    build:
      context: .
      dockerfile: Dockerfile
      target: messaging-service
    environment:
      - BIND_ADDR=0.0.0.0:50052
      - NATS_URL=nats://nats.devlab.ca:4222
      - REDIS_URL=redis://dragonfly:6379
    ports:
      - "50052:50052"

  websocket:
    environment:
      - USERS_SERVICE_URL=http://users-service:50051
      - MESSAGING_SERVICE_URL=http://messaging-service:50052
      - MARKET_SERVICE_URL=http://market-service:50053
    depends_on:
      - users-service
      - messaging-service
      - market-service

  dragonfly:
    image: docker.dragonflydb.io/dragonflydb/dragonfly:latest
    ports:
      - "6379:6379"

I’m using Dragonfly instead of Redis, which is a Redis-compatible cache that apparently uses less memory and handles concurrent access better for certain workloads. I’ll be honest, I switched mostly because someone recommended it and it was a drop-in replacement—the services don’t know the difference since they just talk Redis protocol. Whether it’s actually better for my specific use case, I couldn’t tell you. It works, and that’s enough for now.

The Proto Definitions

I define gRPC types in Protocol Buffers, and the tooling compiles them at build time:

// server/crates/backend-proto/proto/users.proto
syntax = "proto3";
package backend.users;

service UsersService {
    rpc ListProfiles(ListProfilesRequest) returns (ListProfilesResponse);
    rpc GetProfile(GetProfileRequest) returns (Profile);
    rpc UpdateProfile(UpdateProfileRequest) returns (Profile);
    rpc GetOnlineUsers(GetOnlineUsersRequest) returns (GetOnlineUsersResponse);
    rpc SetOnlineStatus(SetOnlineStatusRequest) returns (SetOnlineStatusResponse);
}

message Profile {
    string id = 1;
    string account_id = 2;
    string handle = 3;
    string display_name = 4;
    string location = 5;
    string bio = 6;
    repeated string interests = 7;
    string created_at = 8;
    string last_seen_at = 9;
    bool is_sysop = 10;
}

The backend-proto crate generates Rust types from these definitions, which means type-safe RPC without hand-written serialization code. If the proto changes but I forget to update a handler somewhere, the compiler yells at me, and that’s exactly what I want. I’ve worked on systems where API contracts were enforced by “please remember to update both sides,” and it never works. (I once shipped a breaking change to production because I updated the client but forgot the server. Took us three hours to figure out why half the requests were failing.)

What This Architecture Actually Buys Me

So what was the point of all this? I had a bunch of constraints, some explicit and some I only realized partway through.

The main one was that shared state had to live on the server—no more localStorage universes where everyone has their own copy of reality. But I also wanted the browser to keep its REST ergonomics, because I didn’t want to rewrite the frontend to speak gRPC directly. And I didn’t want services knowing about each other, because that kind of coupling always bites you later when you try to change something. Oh, and local development needed to be a single command, because if it takes more than docker compose up to run the thing, I’ll never actually run it.

That combination of constraints pushed me toward this gateway-with-event-bus pattern. It’s not original—it’s basically the standard microservices playbook—but knowing the constraints helped me avoid overthinking it.

The interesting part isn’t really the technology choices. gRPC and NATS are well-understood; I’m not doing anything clever with them. What matters is the ownership boundaries. When a message doesn’t appear, I know to look at the messaging service. When presence is wrong, I check the users service. Each service owns one domain and does it completely, which makes debugging a lot more tractable than poking through a monolith where everything touches everything.

That clarity does cost some wiring. There’s a lot of boilerplate in the gateway handlers, and adding a new endpoint means touching three places instead of one. But the alternative—a monolith where profiles and messages and stocks all live in the same codebase, sharing the same database connection pool, stepping on each other’s toes—would be harder to debug and harder to evolve. I’ve built systems like that before. They’re fine until they’re not, and by the time they’re not, you’re stuck.

Whether this was the right call for a hobby project with a dozen users, I honestly don’t know. It’s probably overengineered. But it was fun to build, and now when two people post to the same board, they actually see each other’s messages. That’s worth something.

See also: Journey Day 10: Microservices & Market Data — when the services went live.

See also: Deep Dive: Passkey Authentication — for the auth side of this story.