Site Reliability Engineer

8. август 2022.

165

Type of engagement

This is an external staff job position. You will have a temporary employment contract with E- Search company, in service of Microsoft.

E- Search helps Microsoft get the best externally employed candidates (Serbian citizens). We are looking for a Site Reliability Engineer (SRE) to join Microsoft Azure Data SQL Team to make sure Azure Data SQL product portfolio is the most reliable across all cloud providers.

Job description

Our client Microsoft has been a leading company in computing for decades. They are a global operation, relied on by governments, utilities, schools, and co-operatives to deliver the things they need to work, every day. In order to make this work for their customers, they need continual effort to make that delivery reliable. In order to drive reliability, they need you — someone who already is, or is interested in becoming SRE.

SREs are people who take engineering-based approaches to solving operations problems: they like infrastructure, they like seeing how big, complicated things work, and most importantly, they gain great satisfaction from making it better. SREs build, monitor, and maintain the systems and infrastructure that ensure our customers can quickly access their data and run workloads whenever and wherever they need to. SREs identify service problems and areas for improvement, and they follow up by fixing those problems.

Do you love to be in the operational thick of things? Do you have experience with DevOps and Live Site, a keen eye for detail and a drive to deliver 99.999% availability? The Azure Data Sql Team is looking for an SRE to create and administrate live site infrastructure. This role will work on our monitor and alerting infrastructure and live site tools to support an excellent live site practice.

We would like to talk to you if you:

Are interested in distributed systems and working with high scale services.
Like to work in a fast-moving environment and you aren't afraid to change things to make them better.
Enjoy new technological challenges and solving hard problems.

Your responsibilities will include some or all of the below:

Technical Knowledge and Domain-Specific Expertise

Develops a foundational understanding of distributed systems design, interactions between cloud technology layers and components, basic dependencies at scale, and the code that defines infrastructures. Can contribute to the code base the defines components or features of systems or cloud technologies to improve the reliability and operability of supported products, with direction with other engineers.
Develops an understanding of the code, features, and operations of specific products at scale as required to contribute to incremental improvements in product availability, reliability, efficiency, observability, and/or performance; participates in on-boarding, code/design reviews, and regular meetings with the engineering teams that develop and/or manage those products.

Contributions to Development and Design

Develops and tests changes to optimize code and improve the observability, reliability and operability of a defined range of platform, system, or product components or features with direction from other engineers.
Supports ongoing engagements with product engineering teams by participating in code/design reviews, regular meetings, on-call rotations, and incident responses throughout product development and operations cycles; draws insights from engagements with product engineering teams and basic analyses of telemetry data to propose potential improvements to code and designs for a defined set of product components or features with guidance from other engineers.

Driving Operational Excellence

Implements configuration and data changes across a predefined range of product components or features with guidance from other engineers to develop an understanding of how configurations, binaries, and data can be managed using code, tooling, and automation.
Develops an understanding of how to manage changes safely and reliably in production by using existing tools and automation to enable product engineering teams implement changes across a defined range of components or features, with direction from other engineers.
Uses existing tools to troubleshoot problems or flaws affecting the availability, reliability, performance, and/or efficiency of components or features with guidance from other engineers. Suggests potential solutions to resolve and prevent recurring issues and brings them to the attention of other engineers or team leads.
Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting basic issues, and deploying appropriate fixes to resolve root cause(s); alerts product teams or owners to major customer impacting issues and escalates the resolution of complex issues and/or those affecting multiple components or features to other engineers as needed. Shares details related to incidents and their resolution through post-mortem reports and during regular review meetings.
Develops an understanding of key learnings, insights, and best practices that can be applied to improve system, platform, and/or product development and operations by participating in code/design reviews, incident drills and debriefs, and regular meetings, as well interactions with more experienced Site Reliability Engineers (SREs) and members of product engineering teams.

Qualifications

Familiarity with scripting languages
Working knowledge of one or more general purpose programming languages including but not limited to C#, JavaScript, PowerShell
Working knowledge of one or more query languages including but not limited to T-SQL
Strong verbal and written communication skills with excellent interpersonal communication and collaboration skills
Degree in Computer Science, System Administration, Networking, Mathematics, and Engineering in general, or an equivalent industry internship or industry engineering experience.

Preferred Qualifications:

2+ years of technical experience in software engineering, network engineering, or systems administration or
2+ years of experience in relevant SRE area, cloud operations, or microservice architecture.
Deep knowledge of industry trends as well as advances in large-scale distributed systems and cloud technologies
Passion for data-driven decision making
Experience in a cloud stack and leveraging cloud architecture, applying site reliability principles and/or demonstrating sensitivity to operational concerns.
Demonstrated ability to debug, fix, and optimize code.
Excellent written and verbal communication skills a plus.

Automation QA engineer

Medior Java Developer

.NET Software Developer

Quality Assurance Tester

Adobe AEM Content Author