We consider the treatment allocation problem via a multi armed bandit approach. As opposed to classical treatment allocation problems where a data set of some size n is is supposed to be given and used to estimate treatment effects of a certain medical drug or economic policy we take the stance that the data arrives gradually. The policy maker thus faces a tradeoff between exploring which (out of potentially many) treatments is best and exploiting the knowledge gathered so far such that few suboptimal treatments are assigned. In practice the policy maker observes individual characteristics (covariates) of an individual prior to assigning a treatment and we show how to take these into account. Furthermore, data may arrive in batches or with delay and we show how to adjust our treatment polices to these settings. Finally, we show how information from previous studies can be taken into account when assigning a treatment. For each setting we prove upper bounds on the regret of or policy compared to the infeasible oracle that had known the best treatment in advance and assigned this. The rates entering the upper bounds are minimax optimal and we also prove upper bounds on the number of times suboptimal treatments are assigned by the proposed policy. The latter is important for ethical reasons as policies that assign too many bad treatments are not practically viable.