Skip to content

Personal tools
You are here: Home On-line Learning with Limited Feedback
Document Actions

On-line Learning with Limited Feedback

Untitled


On-line Learning with Limited Feedback

COLT 09 Workshop, June 18, Montreal, Canada


News

Overview

The main focus of the workshop is the problem of on-line learning when only limited feedback is available to the learner. In on-line learning, at each time step the learner has to predict the outcome corresponding to the next input based on the feedbacks obtained so far. Unlike the usual supervised problem, in which after each prediction the learner is revealed sufficient information to evaluate the goodness of all predictions he could have made, in many cases only limited feedback may be available to the learner. Depending on the nature of the limitation on the feedback, different classes of problems can be identified:
  1. Delayed feedback. The utility of an action (i.e., the prediction) is returned only after a certain amount of time. This is the case of reinforcement learning and on-line control problems where the final outcome of an action may be available only when a goal is finally achieved.

  2. Partial feedback. The feedback is limited to that on the learner's prediction so that no information is available on what would other possible predictions bring. Multi-armed bandits, when only the utility of the pulled arm is returned to the learner, is the classic example for this.

  3. Indirect feedback. Neither the true outcome, nor the utility of the prediction is observed. Only an indirect feedback loosely related to the prediction is returned.

The increasing interest in on-line learning with limited feedback is also motivated by a number of applications, such as recommender systems, web advertisement systems, in which the user's feedback is limited to accepting/ignoring the proposed item, and the true label (i.e., the item the user would prefer the most) is never revealed to the learner.

Goals

Although some aspects of on-line learning with limited feedback have been already thoroughly analyzed (e.g., multi-armed bandit problems [2-4]), many problems are still open. For instance, bandits with large action spaces and side information, learning with delayed reward, on-line optimization, etc., are of primary concern in many recent works on on-line learning. Furthermore, on-line learning with limited feedback has strong connections with a number of other fields of Machine Learning such as active learning, semi-supervised learning, and multi-class classification.
The goal of the workshop is to provide researchers with the possibility to present their current research on these topics and to encourage the discussion about the main open issues and the possible connections between the different sub-fields.
In particular, we expect the workshop to shed light on a number of theoretical issues, such as:

  • how does the performance of learning algorithms scale in either large (e.g., infinity number of arms, either numerable or continuum, or in metric or measurable spaces) or changing action spaces?
  • how does the performance of learning algorithms scale depending on the smoothness of the function to be optimized (Lipschitz, linear, convex, non convex)?
  • what are the connections between the MDP reinforcement learning paradigm and the on-line learning problem with delayed feedback?
  • how to define complexity measures for on-line learning with limited feedback?
  • is it possible to define a unified view on the problem of learning with delayed, partial, and indirect feedback?

Call for Participation

The organizing committee would like to invite the submission of extended abstracts (three to four pages in the conference format plus appendix if needed) describing research on (but not restricted to) the following topics:

  • adversarial/stochastic bandits
  • bandits with side information (contextual bandits, associative RL)
  • bandits with large and/or changing action spaces
  • on-line learning with delayed feedback
  • on-line learning in MDPs and beyond
  • partial monitoring prediction
  • on-line optimization (Lipschitz, linear, convex, non-convex)
  • on-line learning in games
  • applications
We also welcome work-in-progress contributions, as well as papers discussing potential research directions.

Submissions should be sent via email to Alessandro Lazaric at alessandro.lazaric@inria.fr and should be in Postscript, or PDF format.


Important Dates

**EXTENSIONS** Deadline for submission: 2nd May
Notification of acceptance: 15th May
Workshop: 18th June

Structure of the Workshop

We plan to have four invited talks (40 minutes each). Among the accepted papers, six to eight contributed talks (20 minutes each), a poster session, and a final panel discussion (30 minutes). If the papers are of sufficient quantity and quality, we will seek to publish them as an edited book or journal special issue.

Program

Session 1
9.00-9.10 Welcome [talk]
9.10-9.50 Invited talk: Strategies for prediction under imperfect monitoring [talk]
Gabor Lugosi
Abstract: We propose simple randomized strategies for sequential decision (or prediction) under imperfect monitoring, that is, when the decision maker (forecaster) does not have access to the past outcomes but rather to a feedback signal. The proposed strategies are consistent in the sense that they achieve, asymptotically, the best-possible average reward among all fixed actions. It was Rustichini who first proved the existence of such consistent predictors. The forecasters presented in this talk offer the first constructive proof of consistency. Moreover, the proposed algorithms are computationally efficient. We also establish upper bounds for the rates of convergence. In the case of deterministic feedback signals, these rates are optimal up to logarithmic terms. (Joint work with Shie Mannor and Gilles Stoltz)
9.50-10.30 Invited talk: Trading regret rate for computational efficiency in online learning with limited feedback [talk]
Shai Shalev-Shwartz
Abstract: We study low regret algorithms for online learning with limited feedback, where there is an additional constraint on the computational power of the learner. Focusing on multi-armed bandit with side information, we demonstrate cases in which there is a tradeoff between the regret rate and the computational efficiency of the online learning algorithm. In particular, for the class of linear hypotheses we show that the EXP4 prediction strategy achieves the optimal regret but is not efficient. In contrast, we propose much more efficient strategies, still with a vanishing regret, but a worse regret rate.
10.30-11.00 Break
Session 2
11.00-11.30 Piecewise-Stationary Bandit Problems with Side Information [pdf] [talk]
J. Yu, S. Mannor
11.30-12.00 Multi-Armed Bandits with Betting [pdf] [talk]
A. Niculescu-Mizil
12.00-12.30 Forced-Exploration Based Algortihms for Playing in Stochastic Linear Bandits [pdf] [talk]
Y. Abbasi-Yadkori, A. Antos, C. Szepesvari
12.30-14.00 Lunch
Session 3
14.00-14.30 The Offset Tree for Learning with Partial Labels [pdf] [talk]
A. Beygelzimer, J. Langford
14.30-15.00 One-pass approximate k-means optimization [pdf] [talk]
N. Ailon, R. Jaiswal, C. Monteleoni
15.00-15.30 Monte-Carlo Sampling for Regret Minimization in Extensive Games [pdf] [talk]
M. Lanctot, K. Waugh, M. Bowling
15.30-16.00 Break
Session 4
16.00-16.40 Invited talk: Strategic Considerations in Bandit Settings: The Price of Truthfulness in Online Ad Auctions [talk]
Sham Kakade
Abstract: Research at the intersection of machine learning and economics has flourished in recent years due to the realization that many technological systems (such as the internet) are better understood and managed when they are viewed as economic systems rather than just merely technological ones. In particular, there is much recent work on understanding learning algorithms and models in settings where the strategic considerations of the participants must be taken into account (e.g. spam detection).
This talk will examine the this broader issue in the setting of online ad-auctions (in particular ``pay-per-click" auctions, where advertisers are charged only those rounds when their ad is clicked on). Designing such an auction faces the classic explore/exploit dilemma: while gathering information about the ``click-through-rates'' of advertisers, the mechanism may loose revenue; however, this gleaned information may prove valuable in the future for a more profitable allocation. In this sense, such mechanisms are prime candidates to be designed using multi-armed bandit techniques. However, a naive application of multi-armed bandit algorithms would not take into account the strategic considerations of the players --- players might manipulate their bids (which determine the auction's revenue) in a way as to maximize their own utility. Hence, we consider the natural restriction that the auction be truthful.
This work sharply characterizes what regret is achievable, under this truthful restriction. Interestingly, we show that this restriction imposes statistical limits on the achievable regret --- that it is Theta(T^2/3), while for traditional bandit algorithms (without this truthful restriction) the achievable regret is O(T^1/2) (where T is the number of rounds). We term the extra T^1/6 factor, the `price of truthfulness'.
16.40-17.20 Invited talk: Bandit Algorithms for Online Linear Optimization [talk]
Nicolo' Cesa-Bianchi
Abstract: In the online linear optimization problem a forecaster chooses, at each time instance, a vector x from a certain given subset S of the D-dimensional Euclidean space and suffers a time-dependent loss that is linear in x. The goal of the forecaster is to achieve that, in the long run, the accumulated loss is not much larger than that of the best possible vector in S. In this talk we consider the "bandit" setting of the linear optimization problem, in which the forecaster has only access to the losses of the chosen vectors. We survey some recent algorithms that solve this problem. For the special case in which S is a subset of the d-dimensional Boolean hypercube, we describe a new forecaster whose performance, for a variety of concrete choices of S, is better than all previously known bounds, and not improvable in general. We also point out that computationally efficient implementations for various interesting choices of S exist. Joint work with Gabor Lugosi (Barcelona).
17.20-17.30 Final discussion

Invited Speakers

Nicolo' Cesa-Bianchi (Universita' degli Studi di Milano)
Sham Kakade (Toyota Technological Institute)
Gabor Lugosi (Pompeu Fabra University)
Shai Shalev-Shwartz (Toyota Technological Institute)


Organizing Committee

Jean-Yves Audibert (Certis-Universite' Paris Est-Ecole des Ponts ParisTech)
Peter Auer (University of Leoben)
Alessandro Lazaric (INRIA - Team SequeL) - (primary contact)
Remi Munos (INRIA - Team SequeL)
Daniil Ryabko (INRIA - Team SequeL)
Csaba Szepesvari (University of Alberta)

Sebastien Bubeck and Odalric Maillard (INRIA - Team SequeL) will also help in the organization of the workshop.

Please email us with any questions here.

References

[1] A. Banos, "On pseudo-games", Annals of Mathematical Statistics, 39:1932-1945, 1968
[2] P. Auer, "Using confidence bounds for exploitation-exploration trade-offs", Journal of Machine Learning Research, 3:397-422, 2002.
[3] P. Auer, N. Cesa-Bianchi, P. Fischer, "Finite-time analysis of the Multiarmed Bandit Problem", Machine Learning, 47:235-256, 2002.
[4] P. Auer, N. Cesa-Bianchi, Y. Freund, R. Schapire, "The nonstochastic multi-armed bandit problem", SIAM Journal on Computing, 32:48-77, 2002.
[5] N. Cesa-Bianchi, G. Lugosi, "Prediction, Learning, and Games", Cambridge University Press, 2006.
[6] V. Dani, T. P. Hayes, "The Price of Bandit Information for Online Optimization", Advances in Neural Information Processing Systems (NIPS'07), 2007.
[7] J. Langford, T. Zhang, "The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits", Advances in Neural Information Processing Systems (NIPS'07), 2007.
[8] S. M. Kakade, S. Shalev-Shwartz, A. Tewari, "Efficient bandit algorithms for online multiclass prediction", Proceedings of the 25th international conference on Machine learning (ICML'08), 2008.
[9] P. Auer, T. Jaksch, R. Ortner, "Near-optimal Regret Bounds for Reinforcement Learning", Advances in Neural Information Processing Systems (NIPS'08), 2008.
[10] S. Kakade, S. Shalev-Shwartz, "Mind the duality gap: Logarithmic regret algorithms for online optimization", Advances in Neural Information Processing Systems (NIPS'08), 2008.
[11] G. J. Gordon, A. Greenwald, C. Marks, "No-regret learning in convex games", Proceedings of the 25th international conference on Machine learning (ICML'08), 2008.
[12] S. Bubeck, R. Munos, G. Stoltz, C. Szepesvari, "Online Optimization in X-Armed Bandits", Advances in Neural Information Processing Systems (NIPS'08), 2008.
[13] F. Radlinski, R. Kleinberg, T. Joachims, "Learning diverse rankings with multi-armed bandits", Proceedings of the 25th international conference on Machine learning (ICML'08), 2008.
[14] J. Abernethy, E. Hazan, A. Rakhlin, "Competing in the Dark: An efficient Algorithm for bandit linear optimization", Annual Conference on Learning Theory (COLT'08), 2008.
[15] P. L. Bartlett, V. Dani, T. P. Hayes, S. M. Kakade, A. Rakhlin, A. Tewari, "High probability regret bounds for bandit online linear optimization", Annual Conference on Learning Theory (COLT'08), 2008.
[16] P. Auer, N. Cesa-Bianchi, Z. Hussain, L. Newnham, J. Shawe-Taylor, NIPS'08 Workshop on "On-line Trading of Exploration and Exploitation", URL: http://www.homepages.ucl.ac.uk/~ucabzhu/OTEE.htm.
[17] PASCAL2 Network, "Partial or Delayed Feedback Thematic Programme", URL: http://pascallin2.ecs.soton.ac.uk/Thematic/TP13.


Sponsors

« February 2010 »
Su Mo Tu We Th Fr Sa
123456
78910111213
14151617181920
21222324252627
28