Bellevue downtown at night, from Wikipedia
One of the main highlights of ICML 2011 were the invited talks. Usually, I like one invited talk or so, but this time all of them were great. I've heard a number of people remark that the organizers did a great job of choosing the speakers. I should also note that the videos of all the talks, invited and not, will soon appear on TechTalks, and I recommend you watch the invited talks especially.
The first talk was by Chris Bishop, of the Microsoft Research (MSR) Cambridge, UK lab, who talked about applied machine learning coming of age. He addressed a variety of topics, perhaps a bit loosely tied together, but the real highlight of his talk was the first half-hour, where he addressed the technology behind Kinect. Apparently, much of the technology behind it was developed at MSR (this should be good for research funding at Microsoft, as Kinect was a huge commercial success), and he went into some depth about both the machine learning and hardware challenges in building Kinect. For example, machine learning was used for the body tracking system in Kinect, but with some interesting twists, e.g. 1) the body tracking problem isn't treated as a tracking problem at all: each individual frame is independently classified, 2) lots of the training data is synthetic. On the hardware end, I learned that Kinect could track you just fine in the dark. To find out more, you should watch the entire talk.
Kinect sensor, image from Wikipedia
The second invited talk was by a mathematical biologist, Martin Nowak, who talked about the evolutionary dynamics of competition and cooperation. He talked about all sorts of interesting things. What stuck with me the most were some results he presented on the repeated prisoner's dilemma. In the famous classic prisoner's dilemma, each of two prisoners must independently decide whether to cooperate or defect (see figure below for a possible payoff matrix). The dominant strategy for both prisoners is to defect, but if both prisoners cooperate they are better off than both defecting, hence the dilemma. In the real world, we often play variants of this game, but over and over again. The repeated prisoners dilemma setting captures this aspect of repeated play, and a while ago, Robert Axelrod held a famous repeated prisoners dilemma tournament where he showed tit-for-tat was a really good strategy: basically, to cooperate at first and then, on each future round, to do what your opponent did on the previous one. I have a lot more to say on this topic, but I'll skip to Martin's talk. Martin presented some interesting results where he showed us that if noise is introduced into this repeated game (like in real life), and one runs an evolutionary tournament that starts with random strategies, first tit-for-tat will evolve. But then, tit-for-tat will be taken over by a more forgiving versions tit-for-tat, which will subsequently be dominated by completely cooperative strategies, which will be exploited by completely defecting strategies, which will then be taken over by tit-for-tats again. And this cycle repeats! Martin pessimistically suggested this this cycle is inevitable in the real world as well. Of course, his talk had a lot more, so if you want to know the details of this and more, I recommend you watch it!
possible prisoner's dilemma payoff matrix, image from "beyond intractability"
Then, Hatmut Neven, of Google talked about the machine learning behind "Google Goggles," the reverse-image search where you take a picture of an object, send it to Google as a query, and Google will hopefully return some relevant results. I hadn't thought much about this reverse-search problem, so a lot of the talk was interesting to me on a basic level. For example, if you send a picture of a fancy chair, it's useless for Google to tell you it's a chair, you're probably looking to know what type of chair it is so you can buy it -- this makes the problem very hard. But it doesn't mean Hatmut and his team haven't made a scary amount of progress: Hatmut says that Google can, given a picture of an arbitrary person who has about a dozen photos of them in Google's databases (i.e. online), correctly classify them in the top ten results. Google hasn't activated this feature for privacy reasons, but if (or when?) Google does, this means that you could pretty much see someone on the street, take a picture of them, and find out who they are (and they don't even have to be a famous actress). Hatmut thinks that in 10 years, we'll be able to carry a camera with us, which will, without any special querying, know more about the world around us than we do -- it will know what species of trees are around us, the origin of the name of the street we are on, the history of the author of the book we're reading, etc. Exciting stuff. Finally, Hatmut talked about how quantum computing might help with all this, and he gave a much more positive endorsement of D-Wave than I'd read elsewhere.
D-Wave's 128-bit chip, image from Wikipedia
The last talk was by David Ferrucci about the IBM Watson system that beat the top two humans on Jeopardy (see my post on Watson). He talked about the immense challenge of building Watson, the collaboration among many different types of researchers, the scientific problems involved, and the various problems his team had to overcome. The task of building Watson was so monumental that many (non computer scientist) people couldn't get their heads around it. David described an incident when he went on a radio show to talk about Watson and was asked by the radio host "So, you're going to give the computer all the questions and answers, and when a question is asked, the computer will pull up the right answer?" David said, surely rolling his eyes, "No. We don't know any of the answers in advance, let alone the questions." Then the radio show host responded "I don't understand. How are you going to do it then?" "Exactly!" said David. I trust I'm not spoiling anything if I tell you machine learning was a big part of the solution.
Watson's avatar, image from Wikipedia
John Langford (who hosted my postdoc at Yahoo! Research) already blogged that neural networks were making a comeback, so to speak, at ICML 2011. In addition to the sheer number of "deep learning" papers, lots of people excitedly talked about what great things they could train their neural nets to do. I would add to John's post, after talking to Bill and Ilya, that another reason this area is becoming "hot" is that we finally have the computing power to train neural networks for real-world problems, something that was previously much more difficult.
I must admit that I don't know enough about neural networks, and the reason is that it's not easy to see how the fit into the larger world from a learning theory perspective. For example, we know that a single-layer neural network cannot learn even the XOR function, that recurrent neural nets can simulate Turing Machines, and that certain classes of neural nets are not properly PAC learnable. But from what I can see, the theory of neural nets is of a very different flavor than, say, the theory of decision trees. If someone who knows these connections better wrote up a survey, I'm sure I'm not the only one who would be grateful. (Even better, maybe such a survey exists and I'm not aware of it.)
Papers I Liked
Given that ICML has 4 or 5 parallel tracks, I didn't get to see even 20% of the talks, but I tried to catch what I could at the poster sessions. Here are a few non neural-networks (for the neural networks ones, see John's post) papers I liked, besides my own, of course.
"On the Necessity of Irrelevant Variables" by David Helmbold and Phil Long. The main gist of this paper is that irrelevant variables don't hurt a predictor nearly as much as relevant variables help. So if you're trying to do feature selection and aren't sure whether to add a feature or not, add it.
"Clustering Partially Observed Graphs via Covex Optimization" by Ali Jalali, Yudong Chen, Sujay Sanghavi, and Huan Xu. Here, they do correlation clustering with partial observations, which is NP-Hard, so instead of the usual approach of making an approximation algorithm, they give an algorithm which succeeds of the data satisfies certain properties. They're not the first to do this type of analysis, but I rather liked their type of data-dependent assumption.
"Optimal Distributed Online Prediction" by Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. This paper considers online prediction, but in a distributed model, where one can get speedups, possibly at some cost in the regret. They derive optimal regret bounds for convex losses and i.i.d. examples, and show that for many parameter settings, you can actually get the same asymptotic regret as without parallelizing.
"Doubly Robust Policy Evaluation and Learning" by Miroslav Dudik, John Langford, and Lihong Li. I'm a bit biased about this paper because I was at Yahoo! when they were woking on it. They basically show a technique for evaluating a new bandit policy on historical data, where you can get an unbiased evaluation as long as you have either a good model of rewards or a good model of the past policy. Hence, doubly robust.
ICML 2011 was great. I'm looking forward to ICML 2012! One extra bonus is that ICML 2012 will be co-located in Edinburgh with COLT 2012, the main learning theory conference.
Update (7/26/11): Videos of all the talks from ICML 2011 have now been posted here.
If you're interested in machine learning, you should follow John Langford's blog, hunch.net. Also, John will co-chair ICML 2012.