As a Staff Infrastructure Engineer on the Infrastructure Reliability team, you will be a critical part of our efforts to ensure the scalability, availability, and performance of our global network serving millions of players across the world.
This role demands deep network architectural knowledge in the Service Provider space, and strong experience operating global scale infrastructure. You should have strong coding skills, a passion for automation, and a focus on reliability engineering to deliver robust and maintainable systems.
You will work on network design, traffic analysis and engineering, maintaining CI/CD pipeline and creating tools to enhance observability and streamline troubleshooting for core infrastructure services.
Your role will include:
- Designing, deploying, and operating the global network: Plan, build, and maintain both new and existing infrastructure to deliver the best possible experience for our players and internal customers.
- Coding and automation: Write clean, efficient, and reusable code to automate operational tasks, improve system reliability, and enable rapid scaling.
- Developing customer-centric tooling: Build tools to simplify and streamline the consumption of cloud resources for internal teams, empowering them to innovate faster.
- Observability and troubleshooting: Enhance monitoring and logging systems to quickly detect, debug, and resolve issues across our infrastructure.
- Mentorship and continuous learning: Guide and mentor junior and senior engineers in systems, cloud, and network engineering, fostering a culture of growth and continuous learning.
- Timezone Collaboration: Partner closely with engineers across various timezones to maximize coverage, responsiveness, and global reach.
Responsibilities:
- Solve complex challenges independently, diagnosing and resolving production issues across globally distributed systems.
- Advance our monitoring and observability platforms, driving innovation that keep our infrastructure visible, actionable, and resilient.
- Troubleshoot live incidents (on-call rotation) and design resilient solutions to maintain uptime and meet SLAs, continually evolving our infrastructure to improve reliability and adaptability.
- Expand and optimize our network footprint, enhancing the scalability, reliability, and efficiency of our network.
- Elevate your team by sharing knowledge, mentoring peers, and fostering a culture of continuous learning and growth.
Required Qualifications:
- 5+ years of experience as a senior contributor in a service provider, focused on design and operations for large-scale global networks.
- Expertise in protocols such as BGP, IS-IS, label signalling (RSVP-TE, Segment Routing, LDP), MPLS VPNs (both layer 2 and layer 3), multicast signalling.
- Experienced in operating large-scale web services with strong expertise in OSI layers 4 7 technologies and global load balancing strategies.
- QoS experience across multiple vendor hardware implementations.
- Troubleshooting and Incident Response: Skilled at troubleshooting live incidents, with a proactive approach to minimizing downtime and service impact.
- Familiarity with Root Cause Analysis (RCA) processes to identify, document, and drive long-term solutions to recurring issues.
- Automation and Scripting: Proficiency in scripting and programming languages like Python and Golang to drive automation, manage deployments, and create tooling.
- Cloud Connectivity: Expertise in AWS connectivity solutions and foundational services (e.g., S3, EC2, EBS). Experience in container management and orchestration with Docker and Kubernetes.
- Adaptability: Ability to quickly adopt and adapt to new technologies, frameworks, and cloud-native tools to solve complex problems.
- Team Leadership: Proven experience in guiding delivery goals across teams, advocating for best practices, and driving alignment on cross-initiative projects and initiatives.
- Excellent Communication Skills: Demonstrates clear, concise, and proactive communication, ensuring effective collaboration, timely information-sharing, and alignment across diverse teams and stakeholders.
It s our policy to provide equal employment opportunity for all applicants and employees of Bee Talent Solutions. The Company makes reasonable accommodations for handicapped and disabled employees and does not unlawfully discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity or expression, national origin, age, handicap, veteran status, marital status, criminal history, or any other category protected by applicable federal and state law. We consider for employment all qualified applicants, including those with criminal histories, in a manner consistent with applicable federal, state and local law, including, but not limited to, the California Fair Chance Act, the City of Los Angeles Fair Chance Initiative for Hiring Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, the San Francisco Fair Chance Ordinance, and the Washington Fair Chance Act.
Per the Los Angeles County Fair Chance Ordinance, the following core duties may create a basis for disqualifying candidates with relevant criminal histories:
- Safeguarding confidential and sensitive data while employed by us and while on assignment at a customer of ours
- Communication with others, including employees and third parties such as vendors, customers (including their employees), and/or players, including minors
- Accessing our or our customer s assets, secure digital systems, and networks
- Ensuring a safe interactive environment for players, employees, and temporary workers
These duties are directly related to essential operations, safety, trust, and compliance obligations within our organization and within the organization of any customer to whom you may be assigned while employed by us. Please note that job duties may evolve based on business needs and additional responsibilities may be assigned as necessary to maintain operational efficiency and security.