On-line Learning with Limited Feedback
COLT 09 Workshop, June 18, Montreal, Canada
Overview | Goals | Call for Participation | Important Dates | Structure of the Workshop | Program | Invited Speakers | Organizing Committee | References | Sponsors
News
- Tentative program available
- Acceptance notification postponed to May 22
- DEADLINE EXTENSION: May 2
- The workshop is sponsored by PASCAL2 Network and the Alberta Ingenuity Center for Machine Learning
- Important Dates available
- Delayed feedback. The utility of an action (i.e., the prediction) is returned only after a certain amount of time. This is the case of reinforcement learning and on-line control problems where the final outcome of an action may be available only when a goal is finally achieved.
- Partial feedback. The feedback is limited to that on the learner's prediction so that no information is available on what would other possible predictions bring. Multi-armed bandits, when only the utility of the pulled arm is returned to the learner, is the classic example for this.
- Indirect feedback. Neither the true outcome, nor the utility of the prediction is observed. Only an indirect feedback loosely related to the prediction is returned.
Although some aspects of on-line learning with limited feedback have
been already thoroughly analyzed (e.g., multi-armed bandit problems [2-4]), many
problems are still open. For instance, bandits with large action spaces and side
information, learning with delayed reward, on-line optimization, etc., are
of primary concern in many recent works on on-line learning.
Furthermore, on-line learning with limited feedback has strong connections with a
number of other fields of Machine Learning such as active learning,
semi-supervised learning, and multi-class classification.
The goal of the workshop is to provide researchers with the possibility to present their current research on these topics and to encourage the discussion about the main open issues and the possible connections between the different sub-fields.
In particular, we expect the
workshop to shed light on a number of theoretical issues, such as:
- how does the performance of learning algorithms scale in either large (e.g., infinity number of arms, either numerable or continuum, or in metric or measurable spaces) or changing action spaces?
- how does the performance of learning algorithms scale depending on the smoothness of the function to be optimized (Lipschitz, linear, convex, non convex)?
- what are the connections between the MDP reinforcement learning paradigm and the on-line learning problem with delayed feedback?
- how to define complexity measures for on-line learning with limited feedback?
- is it possible to define a unified view on the problem of learning with delayed, partial, and indirect feedback?
The organizing committee would like to invite the submission of extended abstracts (three to four pages in the conference format plus appendix if needed) describing research on (but not restricted to) the following topics:
- adversarial/stochastic bandits
- bandits with side information (contextual bandits, associative RL)
- bandits with large and/or changing action spaces
- on-line learning with delayed feedback
- on-line learning in MDPs and beyond
- partial monitoring prediction
- on-line optimization (Lipschitz, linear, convex, non-convex)
- on-line learning in games
- applications
Submissions should be sent via email to Alessandro Lazaric at alessandro.lazaric@inria.fr and should be in Postscript, or PDF format.
**EXTENSIONS** Deadline for submission: 2nd May
Notification of acceptance: 15th May
Workshop: 18th June
We plan to have four invited talks (40 minutes each). Among the accepted papers,
six to eight contributed talks (20 minutes each), a poster session, and a final
panel discussion (30 minutes). If the papers are of sufficient quantity and
quality, we will seek to publish them as an edited book or journal special
issue.
| Session 1 | |
| 9.00-9.10 | Welcome [talk] |
| 9.10-9.50 | Invited talk: Strategies for prediction under imperfect monitoring [talk] Gabor Lugosi Abstract: We propose simple randomized strategies for sequential decision (or prediction) under imperfect monitoring, that is, when the decision maker (forecaster) does not have access to the past outcomes but rather to a feedback signal. The proposed strategies are consistent in the sense that they achieve, asymptotically, the best-possible average reward among all fixed actions. It was Rustichini who first proved the existence of such consistent predictors. The forecasters presented in this talk offer the first constructive proof of consistency. Moreover, the proposed algorithms are computationally efficient. We also establish upper bounds for the rates of convergence. In the case of deterministic feedback signals, these rates are optimal up to logarithmic terms. (Joint work with Shie Mannor and Gilles Stoltz) |
| 9.50-10.30 | Invited talk: Trading regret rate for computational efficiency in online learning
with limited feedback [talk] Shai Shalev-Shwartz Abstract: We study low regret algorithms for online learning with limited feedback, where there is an additional constraint on the computational power of the learner. Focusing on multi-armed bandit with side information, we demonstrate cases in which there is a tradeoff between the regret rate and the computational efficiency of the online learning algorithm. In particular, for the class of linear hypotheses we show that the EXP4 prediction strategy achieves the optimal regret but is not efficient. In contrast, we propose much more efficient strategies, still with a vanishing regret, but a worse regret rate. |
| 10.30-11.00 | Break |
| Session 2 | |
| 11.00-11.30 | Piecewise-Stationary Bandit Problems with Side Information [pdf] [talk] J. Yu, S. Mannor |
| 11.30-12.00 | Multi-Armed Bandits with Betting [pdf] [talk] A. Niculescu-Mizil |
| 12.00-12.30 | Forced-Exploration Based Algortihms for Playing in Stochastic Linear Bandits [pdf] [talk] Y. Abbasi-Yadkori, A. Antos, C. Szepesvari |
| 12.30-14.00 | Lunch |
| Session 3 | |
| 14.00-14.30 | The Offset Tree for Learning with Partial Labels [pdf] [talk] A. Beygelzimer, J. Langford |
| 14.30-15.00 | One-pass approximate k-means optimization [pdf] [talk] N. Ailon, R. Jaiswal, C. Monteleoni |
| 15.00-15.30 | Monte-Carlo Sampling for Regret Minimization in Extensive Games [pdf] [talk] M. Lanctot, K. Waugh, M. Bowling |
| 15.30-16.00 | Break |
| Session 4 | |
| 16.00-16.40 | Invited talk: Strategic Considerations in Bandit Settings: The Price of Truthfulness in Online Ad Auctions [talk] Sham Kakade Abstract: Research at the intersection of machine learning and economics has flourished in recent years due to the realization that many technological systems (such as the internet) are better understood and managed when they are viewed as economic systems rather than just merely technological ones. In particular, there is much recent work on understanding learning algorithms and models in settings where the strategic considerations of the participants must be taken into account (e.g. spam detection). This talk will examine the this broader issue in the setting of online ad-auctions (in particular ``pay-per-click" auctions, where advertisers are charged only those rounds when their ad is clicked on). Designing such an auction faces the classic explore/exploit dilemma: while gathering information about the ``click-through-rates'' of advertisers, the mechanism may loose revenue; however, this gleaned information may prove valuable in the future for a more profitable allocation. In this sense, such mechanisms are prime candidates to be designed using multi-armed bandit techniques. However, a naive application of multi-armed bandit algorithms would not take into account the strategic considerations of the players --- players might manipulate their bids (which determine the auction's revenue) in a way as to maximize their own utility. Hence, we consider the natural restriction that the auction be truthful. This work sharply characterizes what regret is achievable, under this truthful restriction. Interestingly, we show that this restriction imposes statistical limits on the achievable regret --- that it is Theta(T^2/3), while for traditional bandit algorithms (without this truthful restriction) the achievable regret is O(T^1/2) (where T is the number of rounds). We term the extra T^1/6 factor, the `price of truthfulness'. |
| 16.40-17.20 | Invited talk: Bandit Algorithms for Online Linear Optimization [talk] Nicolo' Cesa-Bianchi Abstract: In the online linear optimization problem a forecaster chooses, at each time instance, a vector x from a certain given subset S of the D-dimensional Euclidean space and suffers a time-dependent loss that is linear in x. The goal of the forecaster is to achieve that, in the long run, the accumulated loss is not much larger than that of the best possible vector in S. In this talk we consider the "bandit" setting of the linear optimization problem, in which the forecaster has only access to the losses of the chosen vectors. We survey some recent algorithms that solve this problem. For the special case in which S is a subset of the d-dimensional Boolean hypercube, we describe a new forecaster whose performance, for a variety of concrete choices of S, is better than all previously known bounds, and not improvable in general. We also point out that computationally efficient implementations for various interesting choices of S exist. Joint work with Gabor Lugosi (Barcelona). |
| 17.20-17.30 | Final discussion |
| Nicolo' Cesa-Bianchi | (Universita' degli Studi di Milano) |
| Sham Kakade | (Toyota Technological Institute) |
| Gabor Lugosi | (Pompeu Fabra University) |
| Shai Shalev-Shwartz | (Toyota Technological Institute) |
| Jean-Yves Audibert | (Certis-Universite' Paris Est-Ecole des Ponts ParisTech) |
| Peter Auer | (University of Leoben) |
| Alessandro Lazaric | (INRIA - Team SequeL) - (primary contact) |
| Remi Munos | (INRIA - Team SequeL) |
| Daniil Ryabko | (INRIA - Team SequeL) |
| Csaba Szepesvari | (University of Alberta) |
Sebastien Bubeck and Odalric Maillard (INRIA - Team SequeL) will also help in the organization of the workshop.
Please email us with any questions here.
[1] A. Banos, "On pseudo-games", Annals of Mathematical Statistics, 39:1932-1945, 1968
[2] P. Auer, "Using confidence bounds for exploitation-exploration trade-offs", Journal of Machine Learning Research, 3:397-422, 2002.
[3] P. Auer, N. Cesa-Bianchi, P. Fischer, "Finite-time analysis of the Multiarmed Bandit Problem", Machine Learning, 47:235-256, 2002.
[4] P. Auer, N. Cesa-Bianchi, Y. Freund, R. Schapire, "The nonstochastic multi-armed bandit problem", SIAM Journal on Computing, 32:48-77, 2002.
[5] N. Cesa-Bianchi, G. Lugosi, "Prediction, Learning, and Games", Cambridge University Press, 2006.
[6] V. Dani, T. P. Hayes, "The Price of Bandit Information for Online Optimization", Advances in Neural Information Processing Systems (NIPS'07), 2007.
[7] J. Langford, T. Zhang, "The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits", Advances in Neural Information Processing Systems (NIPS'07), 2007.
[8] S. M. Kakade, S. Shalev-Shwartz, A. Tewari, "Efficient bandit algorithms for online multiclass prediction", Proceedings of the 25th international conference on Machine learning (ICML'08), 2008.
[9] P. Auer, T. Jaksch, R. Ortner, "Near-optimal Regret Bounds for Reinforcement Learning", Advances in Neural Information Processing Systems (NIPS'08), 2008.
[10] S. Kakade, S. Shalev-Shwartz, "Mind the duality gap: Logarithmic regret algorithms for online optimization", Advances in Neural Information Processing Systems (NIPS'08), 2008.
[11] G. J. Gordon, A. Greenwald, C. Marks, "No-regret learning in convex games", Proceedings of the 25th international conference on Machine learning (ICML'08), 2008.
[12] S. Bubeck, R. Munos, G. Stoltz, C. Szepesvari, "Online Optimization in X-Armed Bandits", Advances in Neural Information Processing Systems (NIPS'08), 2008.
[13] F. Radlinski, R. Kleinberg, T. Joachims, "Learning diverse rankings with multi-armed bandits", Proceedings of the 25th international conference on Machine learning (ICML'08), 2008.
[14] J. Abernethy, E. Hazan, A. Rakhlin, "Competing in the Dark: An efficient Algorithm for bandit linear optimization", Annual Conference on Learning Theory (COLT'08), 2008.
[15] P. L. Bartlett, V. Dani, T. P. Hayes, S. M. Kakade, A. Rakhlin, A. Tewari, "High probability regret bounds for bandit online linear optimization", Annual Conference on Learning Theory (COLT'08), 2008.
[16] P. Auer, N. Cesa-Bianchi, Z. Hussain, L. Newnham, J. Shawe-Taylor, NIPS'08 Workshop on "On-line Trading of Exploration and Exploitation", URL: http://www.homepages.ucl.ac.uk/~ucabzhu/OTEE.htm.
[17] PASCAL2 Network, "Partial or Delayed Feedback Thematic Programme", URL: http://pascallin2.ecs.soton.ac.uk/Thematic/TP13.


