Technical Lead (DevOps & Infrastructure Focus) - Vice President
- Job Req Id:
- 25923401
- Location(s):
- Mississauga, Ontario, Canada
- Job Type:
- Hybrid
- Posted:
- Jan. 08, 2026
Discover your future at Citi
Working at Citi is far more than just a job. A career with us means joining a team of more than 230,000 dedicated people from around the globe. At Citi, you’ll have the opportunity to grow your career, give back to your community and make a real impact.
Job Overview
Overview
We are seeking a highly skilled and experienced individual to fill a unique hybrid role that combines senior-level DevOps and Infrastructure Engineering with the responsibilities of a Working Scrum Master. This position is for a hands-on engineer who actively contributes to the design, implementation, and maintenance of our infrastructure and automation, while simultaneously facilitating the agile development process for their technical team. The ideal candidate will be a strong technical leader, a passionate advocate for agile practices, and a driver of continuous improvement within a complex engineering environment. This role is for someone who thrives on both coding and coaching, with an additional understanding of the infrastructure needs and operational considerations for Artificial Intelligence and Machine Learning initiatives.
Responsibilities
Hands-on DevOps & Infrastructure Engineering
Design & Implementation: Lead the design, implementation, and ongoing management of secure, scalable, and resilient infrastructure components.
Secret & Certificate Management: Administer and maintain secret and certificate management solutions using HashiCorp Vault, including policy definition and integration.
Database Management: Perform hands-on administration and optimization of database systems (PostgreSQL, Oracle, MongoDB), including performance tuning, backup, and recovery strategies.
Workflow Orchestration: Deploy, monitor, and troubleshoot data orchestration workflows using Apache Airflow, and develop/optimize DAGs.
Messaging Systems: Implement and manage messaging queues such as Kafka and IBM MQ, including cluster setup and configuration.
API Integrations: Develop, maintain, and troubleshoot RESTful API and SOAP integrations critical for system connectivity.
Build Automation: Implement and optimize build and deployment processes using Gradle.
Container Orchestration: Design, implement, and manage container orchestration platforms with Kubernetes and Helm, including integration with CyberArk and HashiCorp for secrets management. Create, debug, and troubleshoot Kubernetes PODs, Jobs, and Deployments using YAML.
Storage Management: Configure and manage persistent storage solutions including PVC, SONiC NAS, and S3, with an awareness of storage requirements for AI/ML workloads.
Networking & Load Balancing: Set up and maintain load balancing solutions (e.g., Nginx, HAProxy, AWS ELB/ALB, Kubernetes Ingress controllers) for high availability and performance.
Monitoring & Logging: Implement, configure, and utilize comprehensive monitoring and logging solutions (Prometheus, Grafana, ELK Stack) to ensure system health and proactively identify issues, including those relevant to AI/ML applications.
Automation & Scripting: Develop robust automation scripts and tools using Python, Bash, Go, or similar languages to streamline operations and enhance efficiency.
Incident Response: Participate actively in on-call rotations, responding to and resolving critical incidents with hands-on troubleshooting.
Documentation: Create and maintain technical documentation, architecture diagrams, and runbooks for infrastructure components and processes.
Working Scrum Master & Agile Facilitation
Agile Facilitation: Facilitate all Scrum ceremonies (Sprint Planning, Daily Scrum, Sprint Review, Sprint Retrospective) for the DevOps/Infrastructure engineering team.
Technical Coaching: Coach the team on advanced engineering practices, self-organization, cross-functionality, and continuous improvement in the context of infrastructure development, including support for AI/ML initiatives.
Impediment Resolution: Proactively identify and resolve technical impediments and process bottlenecks within the team and across organizational boundaries, paying special attention to unique challenges posed by AI/ML infrastructure.
Backlog Refinement: Collaborate closely with stakeholders (e.g., product owners, technical leads) to ensure a well-defined and prioritized backlog for infrastructure work, technical debt, operational improvements, and AI/ML platform needs.
Process Improvement: Drive continuous improvement in the team's agile and DevOps practices, helping them adapt and optimize their workflow for maximum efficiency and quality.
Team Shielding: Protect the team from external distractions, allowing focused time for hands-on engineering work.
Required Skills and Experience
Hands-on DevOps & Infrastructure Engineering Expertise
Secret & Certificate Management: Proven hands-on experience with HashiCorp Vault (installation, configuration, policy management, integrations).
Database Administration: Strong hands-on experience with at least two of PostgreSQL, Oracle, or MongoDB (installation, tuning, replication, backup/restore).
Workflow Orchestration: Hands-on experience deploying, managing, and developing DAGs for Apache Airflow.
Messaging Systems: Solid hands-on experience with Kafka and/or IBM MQ (cluster setup, topic management, producer/consumer configuration).
Container Orchestration: In-depth hands-on experience with Kubernetes and Helm, including YAML configuration, troubleshooting PODs/Jobs/Deployments, and integrations with secrets management (CyberArk, HashiCorp).
Storage Management: Practical experience with Kubernetes PVCs, Persistent Volumes, S3, and/or enterprise NAS solutions (e.g., SONiC NAS).
Monitoring & Logging: Strong hands-on experience with Prometheus, Grafana, and the ELK Stack (setup, dashboard creation, query optimization, alert configuration).
Scripting & Automation: High proficiency in Python, Bash, or Go for automation, tooling development, and system administration.
Cloud Platforms: Extensive hands-on experience with at least one major cloud provider (AWS, Azure, GCP).
Infrastructure as Code (IaC): Proficiency with IaC tools such as Terraform or Ansible.
CI/CD: Experience designing, implementing, and maintaining CI/CD pipelines (e.g., Jenkins, GitLab CI, GitHub Actions).
API Integration: Experience with RESTful API and SOAP web services.
Build Tools: Proficiency with Gradle for build automation.
AI/ML Awareness & Support
AI/ML Infrastructure Concepts: Understanding of the specific infrastructure requirements for deploying, managing, and scaling Artificial Intelligence and Machine Learning workloads (e.g., GPU resources, specialized storage, MLOps pipelines).
Data for AI/ML: Awareness of data management strategies and data governance principles relevant to AI/ML models and training datasets.
Monitoring AI/ML Systems: Familiarity with metrics and monitoring approaches for the performance and health of AI/ML applications and their underlying infrastructure.
Agile & Leadership Skills
Working Scrum Master Experience: Proven experience acting as a Scrum Master within a technical team where you also performed significant hands-on engineering.
Agile & Scrum Mastery: In-depth knowledge and practical application of Agile principles and the Scrum framework.
Facilitation & Coaching: Excellent facilitation, coaching, and mentoring skills within a technical context.
Communication: Strong verbal and written communication skills, able to bridge technical and process discussions.
Technical Leadership: Ability to guide technical discussions, influence architectural decisions, and drive best practices.
Preferred Qualifications
Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field.
Certified ScrumMaster (CSM) or Professional Scrum Master (PSM) certification.
Relevant cloud certifications (e.g., AWS Certified DevOps Engineer, Azure DevOps Engineer Expert, GCP Professional Cloud DevOps Engineer).
Experience with site reliability engineering (SRE) principles and practices.
Familiarity with other Agile scaling frameworks (e.g., SAFe, LeSS).
Exposure to MLOps platforms or tools (e.g., Kubeflow, MLflow).
------------------------------------------------------
Job Family Group:
Technology------------------------------------------------------
Job Family:
Applications Development------------------------------------------------------
Time Type:
Full time------------------------------------------------------
Primary Location Full Time Salary Range:
$120,800.00 - $170,800.00------------------------------------------------------
Most Relevant Skills
Please see the requirements listed above.------------------------------------------------------
Other Relevant Skills
For complementary skills, please see above and/or contact the recruiter.------------------------------------------------------
Automated Processing and AI
We use automated processing, including artificial intelligence, for our legitimate business interests (or our reasonable and appropriate business purposes) to identify and align the candidate's skills and abilities with a specific job opening. Additionally, if you so choose, or consent, we can match your skills and abilities to other suitable roles at Citi.
Importantly, all our hiring processes and decisions, including determining your suitability for a role, are conducted, checked, and decided by individuals. Our automated processing and AI do not involve relying on automatic or autonomous decision-making. Please refer to any Jurisdictional Considerations, with specific provisions for your country (where relevant) for further details.
------------------------------------------------------
Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.
If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.
View Citi’s EEO Policy Statement and the Know Your Rights poster.
Global Benefits
Discover the top benefits offered to our global workforce, designed to support your well-being, growth and work-life balance. Explore a few of the highlights that make working with us rewarding.
Explore More Jobs
-
US Marketing Governance Regulatory and Policy Lead
- Mumbai, Maharashtra
-
Transaction Manager, Assistant Vice President
- Belfast, Northern Ireland
-
Technology - Application Development, Summer Analyst, Warsaw, 2026
- Warsaw, Mazovia
-
Target Market Remediation Senior Analyst (fixed term)
- Warsaw, Mazovia
-
Early Careers Talent Network
Sign up to receive personalized job matches based on your skills and interests. We'll help you discover opportunities that align with your goals.
-
Career Professionals Talent Network
Sign up to receive tailored job matches based on your skills and experience. Discover opportunities that align with your ambitions.