Job Description
Platform
Gelato is an AI-first, cloud-native platform transforming local, on-demand production and delivery — with zero physical assets. We power:
Global production orchestration: connecting e-commerce customers to 130+ partners across 32 countries. From AI-guided plate design to smart routing and logistics, we deliver to over 5 billion people in under 72 hours—faster, greener, smarter.
GelatoConnect: our AI-driven print production OS that helps producers master digital printing — from inventory to shipping — by automating every step: procurement, prepress, printing, packaging, and dispatch. It’s machine-agnostic, real-time, and built for uptime, precision, and profitability.
AI at the Core
We are not just early adopters of AI — we’re AI-native builders. Across engineering, product, and operations, we embrace AI to accelerate innovation:
AI-first development using LLMs for code generation, testing, documentation, and design exploration.
Autonomous AI agents driving platform resilience, observability, cost control, and internal workflow automation.
AI-native product features built into GelatoConnect and the core platform — improving routing, personalization, and production optimization.
A culture of continuous experimentation with RAG, agent orchestration, prompt engineering, and model fine-tuning.
Responsibilities
Ensure 24/7 platform operations, including on-call support and incident response.
Build and evolve AI agents that automate scaling, recovery, and diagnostics through control-plane patterns and event-driven logic.
Manage multi-cloud infrastructure, including serverless, through declarative, automated workflows.
Collaborate with the Delivery Platform team to improve DevEx with intelligent internal tooling.
Drive reliability, observability, and cost-efficiency across Gelato’s global infrastructure.
Enable engineers with self-service, secure, and scalable delivery pipelines.
Qualifications
Deep understanding of Linux fundamentals (processes, I/O, filesystems, networking, memory management, system internals).
Deep experience with AWS (primary) and GCP (bonus), including cost optimizations, EKS, networking, S3, VPC peering, IAM, and related services.
Strong understanding of Kubernetes internals, including networking, security, and cluster operations.
Hands-on experience with GitOps workflows (e.g. ArgoCD) and Infrastructure as Code using Terraform.
Proficient in Python, Go, or other languages — with a willingness to learn Go if not already familiar.
Comfortable using AI development tools like Cursor, Claude, and coding with LLMs.
Experience with or strong interest in building AI agents and RAG workflows, with a commitment to learn if needed.