Summary
A startup team struggled with cron job sprawl. Scheduled tasks were spread across multiple environments and handled through a mix of custom scripts, serverless timers, and brittle infrastructure. Engineering owned all of it. Even simple requests from marketing required code changes. By shifting to event-based scheduling, the team removed operational bottlenecks, improved reliability, and gave business users more control without compromising stability or visibility.
The problem: cron everywhere, ownership nowhere
Recurring tasks powered a range of workflows; weekly digests, monthly campaigns, and daily data exports. But the scheduling model behind them had become unmanageable.
- Cron jobs were scattered across five systems
- Some ran in Lambdas, others in ECS, a few on leftover EC2 instances
- Timezones, retries, and failures were handled inconsistently
- Schedules were undocumented or buried in Notion
- Marketing had to ask engineers to make even minor changes
Jobs were often forgotten until something broke. And then it became a distraction to everyone.
Previous fixes didn’t reduce the load
The team tried several improvements:
- Built an internal dashboard to track jobs
- Migrated some logic to serverless functions
- Gave business teams access to low-code tooling
- Documented job ownership and schedules
These solutions added structure but not simplicity. Scheduling still lived outside the core system. Most changes still required a deployment or a manual update. Visibility was limited, and reliability remained unpredictable.
The shift: scheduling as an event pattern
Instead of treating scheduling as an external system, the team made it part of their event model.
Each task became part of a larger event flow:
- Jobs emitted a reschedule event after execution
- Delays were expressed in natural syntax, like
+7d
ornext Monday
- Event metadata tracked timing, delivery, and fallback behaviour
- Failures emitted follow-up events for retry or escalation
- Dashboards showed what was scheduled, what had run, and what had failed
They no longer needed cron daemons or scheduling services. Timing was declarative. Execution was event-driven.
Business teams took back control
Marketing and operations teams could now:
- Define and manage recurring workflows
- Launch time-based campaigns without asking engineering
- Adjust timing through simple inputs instead of deployment
- Run experiments without needing backend support
At the same time, engineers benefited from:
- Fewer interrupt-driven support tasks
- Consistent retry and fallback handling
- A single source of truth for scheduled workflows
- Easier monitoring and debugging
What changed
- The team removed five cron systems in less than a week
- Marketing took ownership of their campaign scheduling
- Scheduled task failures became easier to detect and fix
- Engineers no longer needed to be involved in routine scheduling
- Observability and reliability improved across environments
Why it matters
Scheduling often starts simple but becomes fragile as systems grow. Cron is easy to set up and hard to scale.
This team treated scheduled tasks like any other part of their system. By using events to coordinate future work, they built something that was easier to reason about, safer to change, and more reliable under load. Time became just another input — not a special case hidden in infrastructure.