Reinforcement Learning (RL) is an effective method to solve stochastic sequential decision-making problems. This is a problem description common to supply chain operations, however, most RL algorithms are tailored for game-based benchmarks. Here, we propose a deep RL method tailored for supply chain problems. The proposed algorithm deploys a derivative free approach to balance exploration and exploitation of the neural policy’s parameter space, providing means to avoid low quality local optima. Furthermore, the method allows consideration of risk-sensitive formulations to learn a policy that optimizes, for example, the conditional value-at-risk. The capabilities of our algorithm are tested on a multi-echelon supply chain problem, and several combinatorial optimization problems. The results empirically demonstrate the method’s improved sample efficiency compared to the benchmark algorithm proximal policy optimization, and superior performance to shrinking horizon mixed integer formulations. Additionally, its risk-sensitive policy can offer protection from low probability, high severity scenarios. Finally, we provide a sensitivity analysis for technical intuition.