convex Q-learning