AI Inference Engineer

Bosnia and Herzegovina, North Macedonia, Hungary, Croatia, Spain, Czech Republic, Serbia

Be part of the team creating the software foundation for next-generation AI compute platforms. In this role, you’ll work across the full stack — from low-level kernels and hardware-optimized operators to large-scale ML deployment frameworks — in close collaboration with compiler developers, ML scientists, and hardware specialists. This position offers the chance to contribute to state-of-the-art AI infrastructure, fine-tune software for custom hardware, and deepen your expertise in system software and machine learning. 

How You’ll Contribute:

  • Build and optimize inference pipelines for large-scale model serving (LLMs and beyond)  
  • Work with frameworks like PyTorch, TensorRT, and vLLM to deploy models efficiently  
  • Implement and optimize ML models using techniques such as quantization (INT8/FP8), kernel fusion, and efficient batching 
  • Optimize and implement core ML operators (e.g., GEMMs, convolutions, activations, ...) 
  • Investigate and resolve issues through system-level debugging and performance analysis 
  • Define and apply practices for testing, deployment, and scaling AI systems
     

Required skills: 

  • BSc/MSc in Computer Science, Engineering, Mathematics, or related discipline 
  • Strong programming skills in C/C++ or Python in Linux environments using common development tools 
  • Solid knowledge of computer architecture, system software, data structures 
  • Hands-on experience implementing algorithms in high-level languages (C/C++/Python) 
  • Exposure to specialized hardware (GPUs, FPGAs, DSPs, AI accelerators) and frameworks such as OpenCL or CUDA 
  • Experience designing or working with high-performance software systems 
  • Solid knowledge of ML fundamentals 
  • Motivated team player with a strong sense of responsibility

 

You are a great fit if you have experience in at least one of the following areas: 

  • Model serving frameworks (e.g., Triton Inference Server, DeepSpeed Inference, vLLM) 
  • ML runtimes (e.g., ONNX Runtime, TVM, IREE, XLA) 
  • Deploying ML workloads (LLMs, VLMs, NLP, etc.) across distributed systems 
  • Implement and optimize ML operators and kernels with a focus on vectorization and efficient execution (e.g., activation, pooling, quantization) 
  • Hardware-aware optimizations and performance tuning 
  • 2+ years of experience developing software targeting AI hardware 


Contribution to open-source projects (e.g., LLVM/MLIR, PyTorch, TensorFlow, ONNX Runtime, xDSL, IREE) is a big plus.

AI Inference Engineer

Job description

AI Inference Engineer

Personal information
Add
Professional data
Add
Add
Add