All roles

Open role

[Remote] Founding Sr. Operations Support Specialist

Remote · Peru Full-time

Note: The job is a remote job and is open to candidates in USA. InSiteVerse is a startup based in the US developing a FinTech mobile app designed to provide hedge-fund-grade trading intelligence to everyday investors. The Founding Sr. Operations Support Specialist will be responsible for ensuring the reliability and operational excellence of the platform, leading incident management, and establishing a strong observability culture.

Responsibilities

  • Define and manage SLIs, SLOs, and error budgets for critical user journeys
  • Ensure high system availability, low latency, and minimal error rates
  • Proactively identify risks and implement strategies to prevent SLO breaches
  • Partner with engineering to balance reliability vs feature velocity
  • Act as Incident Commander for high-severity (P0/P1) incidents
  • Lead real-time war rooms, ensuring rapid issue resolution
  • Own the full incident lifecycle: detection → response → recovery → RCA → prevention
  • Establish and enforce incident response frameworks, SLAs, and escalation policies
  • Drive blameless postmortems and continuous improvement
  • Monitor and analyze observability dashboards across cloud, analytics, and application layers to identify infrastructure issues, detect application downtime, and uncover system anomalies impacting reliability
  • Build dashboards and alerts for real-time system visibility
  • Correlate signals across infrastructure, application, and AI systems
  • Analyze trends from tickets, logs, and telemetry to detect systemic issues
  • Monitor AI-specific signals (model drift, inference latency, failures, anomalies)
  • Oversee intake of customer tickets, alerts, and operational signals
  • Define and manage priority classification (P0–P3) and response expectations
  • Resolve customer-impacting issues or coordinate with internal teams
  • Drive collaboration across AI, Backend, Frontend, Mobile, DevOps, QA, and Product
  • Define and optimize ticket workflows and escalation paths
  • Lead communication during incidents with both technical and non-technical stakeholders
  • Own the release calendar and operational readiness checks
  • Ensure monitoring, rollback plans, and risk assessments are in place
  • Validate system performance post-deployment
  • Build automated runbooks and self-healing systems
  • Reduce manual intervention through scripting and tooling
  • Improve system resilience using failover, scaling, and redundancy mechanisms

Skills

  • 10+ years in Production Support / SRE / Technical Operations
  • Strong understanding of SLO, SLI, SLA, and error budgets
  • Proven experience in incident management and troubleshooting distributed systems
  • Hands-on experience with cloud platforms (AWS & GCP)
  • Strong debugging and root cause analysis skills
  • Experience supporting mobile applications (iOS/Android)
  • Understanding of DevOps and SRE practices
  • Exposure to AI/ML systems and model behavior monitoring
  • Experience with log management and tracing systems
  • Monitoring & Observability: Azure Monitor, Prometheus, Grafana
  • Incident Management: PagerDuty, Opsgenie (or similar)
  • Scripting/Automation: Python, PowerShell, Bash
  • Logging: ELK Stack, Azure Monitor Logs, Splunk, or Datadog
  • Tracing: OpenTelemetry, Jaeger, Zipkin, or Azure Application Insights
  • Familiarity with low-latency or financial systems

Company Overview

  • We are a pre-seed startup based out of Cincinnati Ohio while our registered office is in North Canton, OHIO. It was founded in 2025, and is headquartered in Canton, Ohio, USA, with a workforce of 2-10 employees. Its website is https://insiteverse.com.
  • More open positions