DevOps Automation Engineer (MLOps)
We are looking for a passionate DevOps engineer who has experience with Machine Learning Operations (MLOps) to help us in discovering and implementing state-of-the-art automation processes and build workflows for our artificial intelligence platform. This role is critical to our team at Area99 as it enables rapid prototyping, increases the accuracy of our deep learning models and enables continuous training of our online models.
As an engineer in our MLOps team you are a key stakeholder and you wear several hats while developing our ML workflows on Azure, AWS and on-premises. Your ultimate goal is to enable Area99 to deliver world class practices within data engineering and machine learning modelling.
- Maintain and expand our private cloud of of nodes and GPUs, manage K8s deployments and maintain our Azure and AWS cloud across multiple regions
- Audit, review and maintain security standards of our cloud services, docker containers, Kubeflow and K8s clusters
- Implement industry leading continuous delivery patterns and work with our software architect on extending the automated build and deployment workflows to achieve successful continuous delivery solutions for Area99 customers.
- Contribute to the design and development of Area99 hybrid cloud to bridge our private and public clouds
- Advocate industry best practices and drive process improvement to maximize capacity and increase velocity of software team.
- Identify and evaluate new tools, practices and techniques to automate our processes and systems towards IaaS paradigm.
- Instrumentation of machine learning monitoring tools, perform problem analysis, troubleshooting and service recovery tasks.
- Good understanding of Machine learning concepts, and familiarity with tools like Jupyter Notebook, Kubeflow and TensorFlow with hands-on experience in deploying machine learning models.
- Practical understanding of distributed systems, data stores, data modeling, indexing and associated trade-offs
- Strong experience in Shell and Python scripting and building ML workflows using Docker and Kubernetes.
- Strong system administration (Linux/Unix) and in-depth technical knowledge of containerization, server provisioning and monitoring techniques.
- Working technical knowledge of network and internet protocols, and standards, including firewalls, core networking, VPN, load Balancing and routing concepts, tunneling, remote access, port forwarding, active directory, HTTP, TCP, UDP, telnet, etc.
- Experience with various virtualization technologies and multi-tenant, private and hybrid cloud environments.
At Area99 we believe that people are key to any successful AI transformation, that's why we carefully select our team. Your success in this job depends on how detail-oriented you are, your technical ability and most important your personality and being able to communicate effectively with your team and other stakeholders.