Thursday, October 27, 2011

ALT 2011

Earlier this month, I traveled to Finland, where I attended ALT 2011 and presented a paper on learning sparse parity functions in the presence of noise. 

a picture I took in Helsinki

COLT (Conference on Learning Theory) and ALT (Algorithmic Learning Theory) are the two main learning theory conferences.  Though COLT is both the larger and better known of the two, it seems that ALT is growing stronger with every passing year, and this year was no exception.  Not only was the program good, but S├ębastien Bubeck gave a timely tutorial on bandits, which deservedly got their own session this year. (I, unfortunately, arrived too late to see the tutorial, but I read the nice slides afterwards.)

I wouldn't be the first to note that the other thing that differentiates ALT from its big brother is that ALT seems to have a broader definition of learning theory, and the conference is therefore more diverse in its topics than a typical COLT.  This has clear benefits, but it also means that there are entire sessions where I am pretty lost.  Of course, as I've previously noted, not feeling like you have to following everything could also be a benefit when you're attending a conference.

I want to mention one paper I particularly enjoyed, especially because of the model, which illustrates the importance of unlabeled data in the agnostic setting:
"Learning a Classifier when the Labeling is Known" by Shai Ben-David and Shalev Ben-David.  This paper works in an agnostic learning model where the learner already knows the labeling function, but wants to find a good succinct representation (i.e. proper learning).  How good a particular representation is depends on the distribution the data is coming from, and this is what the algorithm needs to learn by seeing examples.  It turns out that this, more generous, model is subject to essentially the same sample complexity lower bound as the common agnostic PAC model.

Monday, August 22, 2011

Happenings on CSTheory Q&A

In addition to machine learning, one of my main research interests is theoretical computer science.  Following the success of MathOverflow and other Q&A sites, ("cstheory" for short) was born about a year ago, and I've been an active participant since its inception.  In the last year, we've worked hard to make cstheory a valuable resource and to promote it.  I'd say the efforts paid off -- the site is doing quite well, with over 5000 members.

Moreover, cstheory recently added a blog (Aaron Sterling and Joe Fitzsimons are editors), whose posts are on topics related to theoretical computer science and to the site itself.  In one of the first posts, Suresh Venkatasubramanian, a site moderator and one of the people most responsible for its success, wrote a nice summary of cstheory's first year.

I've also volunteered to occasionally blog about the learning theory side of things.  My first entry, on learning regular languages, was just posted.  It touches on many interesting issues learning theorists regularly encounter. Enjoy!

Saturday, August 13, 2011

Reyzin-Srivastava Trees

When I was a grad student at Yale, Ravi Kannan taught a great sequence of graduate theory courses on approximation algorithms, randomized algorithms, and streaming algorithms, and I was lucky enough to get to take all three.  The first one, approximation algorithms, Ravi taught during my first semester in grad school.  We were allowed to collaborate on assignments, and I discussed many of them with my friend Nikhil Srivastava.

I remember a problem on one of the first assignments was particularly tricky.  The question was about s-t cuts in a graph.  The hard part about this problem was that to solve it, it seemed we needed to efficiently keep track of all pairs minimum s-t cuts.  We started discussing the problem in the afternoon, and though it took us a while, by about 5am we had designed a data structure to do just that.  We realized you can maintain all the cuts in just one tree.  It was perfect.  We jokingly called the data structure the Reyzin-Srivastava tree.

I spent the remainder of the night writing up the assignment, describing this thing in detail, proud of our cleverness.  About a week later, we got the graded assignments back.  Mine had one note on it, next to that particular question of course.  It said "Next time, just reference Gomory-Hu trees when you need to use them."

I learned a couple lessons from this.  The first is that I should remember to do the required reading before starting the assignment (we used Vijay Vazirani's approximation algorithms textbook, which really is worth reading).  Had I done the reading, I would have learned about Gomory-Hu trees and realized the question asked us to trivially apply them to some problem, instead of reinventing them.   The second, and perhaps, more important lesson was that I had a shot at doing research, at inventing something new.  Yes, these trees had been invented decades before, and yes, we'd been exposed to more data structures than Gomory and Hu had been in 1961.  But somehow knowing that we figured out something on our own that wasn't trivial really helped our confidence as first year grad students, and as I've written before, confidence is quite important for research.

Nikhil and I went on to do some real research together, producing two joint papers the following year.  Actually, both of those papers also began as classwork, for Dana Angluin's machine learning class.  But that's a story better left for another post.

Friday, July 15, 2011

On ICML 2011

I've heard from some people that they liked my summary of NIPS 2010, so I'm posting my thoughts on ICML 2011, where I went a couple weeks ago.  ICML is one of the two main yearly conferences in machine learning, so there are always interesting new results to see, and this year saw a record of over 700 attendees.  Unfortunately, I didn't get to attend the tutorials or the workshops, so my post is about the main conference.  The conference took place in Bellevue, a city near Seattle, Washington.
Bellevue downtown at night, from Wikipedia

Invited Talks

One of the main highlights of ICML 2011 were the invited talks.  Usually, I like one invited talk or so, but this time all of them were great.   I've heard a number of people remark that the organizers did a great job of choosing the speakers.  I should also note that the videos of all the talks, invited and not, will soon appear on TechTalks, and I recommend you watch the invited talks especially.

The first talk was by Chris Bishop, of the Microsoft Research (MSR) Cambridge, UK lab, who talked about applied machine learning coming of age.  He addressed a variety of topics, perhaps a bit loosely tied together, but the real highlight of his talk was the first half-hour, where he addressed the technology behind Kinect.  Apparently, much of the technology behind it was developed at MSR (this should be good for research funding at Microsoft, as Kinect was a huge commercial success), and he went into some depth about both the machine learning and hardware challenges in building Kinect. For example, machine learning was used for the body tracking system in Kinect, but with some interesting twists, e.g. 1) the body tracking problem isn't treated as a tracking problem at all: each individual frame is independently classified, 2) lots of the training data is synthetic.  On the hardware end, I learned that Kinect could track you just fine in the dark.  To find out more, you should watch the entire talk.
Kinect sensor, image from Wikipedia

The second invited talk was by a mathematical biologist, Martin Nowak, who talked about the evolutionary dynamics of competition and cooperation.  He talked about all sorts of interesting things.  What stuck with me the most were some results he presented on the repeated prisoner's dilemma.  In the famous classic prisoner's dilemma, each of two prisoners must independently decide whether to cooperate or defect (see figure below for a possible payoff matrix).  The dominant strategy for both prisoners is to defect, but if both prisoners cooperate they are better off than both defecting, hence the dilemma.  In the real world, we often play variants of this game, but over and over again.  The repeated prisoners dilemma setting captures this aspect of repeated play, and a while ago, Robert Axelrod held a famous repeated prisoners dilemma tournament where he showed tit-for-tat was a really good strategy: basically, to cooperate at first and then, on each future round, to do what your opponent did on the previous one.  I have a lot more to say on this topic, but I'll skip to Martin's talk.  Martin presented some interesting results where he showed us that if noise is introduced into this repeated game (like in real life), and one runs an evolutionary tournament that starts with random strategies, first tit-for-tat will evolve.  But then, tit-for-tat will be taken over by a more forgiving versions tit-for-tat, which will subsequently be dominated by completely cooperative strategies, which will be exploited by completely defecting strategies, which will then be taken over by tit-for-tats again.  And this cycle repeats!  Martin pessimistically suggested this this cycle is inevitable in the real world as well.  Of course, his talk had a lot more, so if you want to know the details of this and more, I recommend you watch it!
possible prisoner's dilemma payoff matrix, image from "beyond intractability"

Then, Hatmut Neven, of Google talked about the machine learning behind "Google Goggles," the reverse-image search where you take a picture of an object, send it to Google as a query, and Google will hopefully return some relevant results.  I hadn't thought much about this reverse-search problem, so a lot of the talk was interesting to me on a basic level.  For example, if you send a picture of a fancy chair, it's useless for Google to tell you it's a chair, you're probably looking to know what type of chair it is so you can buy it -- this makes the problem very hard.  But it doesn't mean Hatmut and his team haven't made a scary amount of progress: Hatmut says that Google can, given a picture of an arbitrary person who has about a dozen photos of them in Google's databases (i.e. online), correctly classify them in the top ten results.  Google hasn't activated this feature for privacy reasons, but if (or when?) Google does, this means that you could pretty much see someone on the street, take a picture of them, and find out who they are (and they don't even have to be a famous actress).  Hatmut thinks that in 10 years, we'll be able to carry a camera with us, which will, without any special querying, know more about the world around us than we do -- it will know what species of trees are around us, the origin of the name of the street we are on, the history of the author of the book we're reading, etc. Exciting stuff.  Finally, Hatmut talked about how quantum computing might help with all this, and he gave a much more positive endorsement of D-Wave than I'd read elsewhere.
D-Wave's 128-bit chip, image from Wikipedia

The last talk was by David Ferrucci about the IBM Watson system that beat the top two humans on Jeopardy (see my post on Watson).  He talked about the immense challenge of building Watson, the collaboration among many different types of researchers, the scientific problems involved, and the various problems his team had to overcome.  The task of building Watson was so monumental that many (non computer scientist) people couldn't get their heads around it.  David described an incident when he went on a radio show to talk about Watson and was asked by the radio host "So, you're going to give the computer all the questions and answers, and when a question is asked, the computer will pull up the right answer?"  David said, surely rolling his eyes, "No. We don't know any of the answers in advance, let alone the questions."  Then the radio show host responded "I don't understand. How are you going to do it then?" "Exactly!" said David.  I trust I'm not spoiling anything if I tell you machine learning was a big part of the solution.
Watson's avatar, image from Wikipedia

Neural Networks 

John Langford (who hosted my postdoc at Yahoo! Research) already blogged that neural networks were making a comeback, so to speak, at ICML 2011.  In addition to the sheer number of "deep learning" papers, lots of people excitedly talked about what great things they could train their neural nets to do.  I would add to John's post, after talking to Bill and Ilya, that another reason this area is becoming "hot" is that we finally have the computing power to train neural networks for real-world problems, something that was previously much more difficult.

I must admit that I don't know enough about neural networks, and the reason is that it's not easy to see how the fit into the larger world from a learning theory perspective.  For example, we know that a single-layer neural network cannot learn even the XOR function, that recurrent neural nets can simulate Turing Machines, and that certain classes of neural nets are not properly PAC learnable.  But from what I can see, the theory of neural nets is of a very different flavor than, say, the theory of decision trees.  If someone who knows these connections better wrote up a survey, I'm sure I'm not the only one who would be grateful. (Even better, maybe such a survey exists and I'm not aware of it.)

Papers I Liked

Given that ICML has 4 or 5 parallel tracks, I didn't get to see even 20% of the talks, but I tried to catch what I could at the poster sessions.  Here are a few non neural-networks (for the neural networks ones, see John's post) papers I liked, besides my own, of course.

"On the Necessity of Irrelevant Variables" by David Helmbold and Phil Long.  The main gist of this paper is that irrelevant variables don't hurt a predictor nearly as much as relevant variables help.  So if you're trying to do feature selection and aren't sure whether to add a feature or not, add it.

"Clustering Partially Observed Graphs via Covex Optimization" by Ali Jalali, Yudong Chen, Sujay Sanghavi, and Huan Xu.  Here, they do correlation clustering with partial observations, which is NP-Hard, so instead of the usual approach of making an approximation algorithm, they give an algorithm which succeeds of the data satisfies certain properties.  They're not the first to do this type of analysis, but I rather liked their type of data-dependent assumption.

"Optimal Distributed Online Prediction" by Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao.  This paper considers online prediction, but in a distributed model, where one can get speedups, possibly at some cost in the regret.  They derive optimal regret bounds for convex losses and i.i.d. examples, and show that for many parameter settings,  you can actually get the same asymptotic regret as without parallelizing.

"Doubly Robust Policy Evaluation and Learning" by Miroslav Dudik, John Langford, and Lihong Li.  I'm a bit biased about this paper because I was at Yahoo! when they were woking on it. They basically show a technique for evaluating a new bandit policy on historical data, where you can get an unbiased evaluation as long as you have either a good model of rewards or a good model of the past policy. Hence, doubly robust.


ICML 2011 was great.  I'm looking forward to ICML 2012! One extra bonus is that ICML 2012 will be co-located in Edinburgh with COLT 2012, the main learning theory conference.

Update (7/26/11): Videos of all the talks from ICML 2011 have now been posted here.

If you're interested in machine learning, you should follow John Langford's blog,  Also, John will co-chair ICML 2012.

Thursday, May 19, 2011

Three Tweets

It is well-known that it's not always easy to explain scientific ideas to the public.  Scientists are often blamed for being bad communicators, but I don't think that's fair.  Most people simply aren't interested enough to read detailed explanations of science (or of anything for that matter).  And scientists are hesitant to give oversimplified explanations because, among other reasons, oversimplified explanations are by definition not correct.  The problem surely exists in all sorts of disciplines, but is probably exacerbated in the sciences/math, where one often needs years of postgraduate study to truly grasp what's going on.

Sometimes, though, it doesn't hurt to give simplified explanations of complicated phenomena.  Our universe is a cool and interesting place, and making some of what we've learned accessible to more people isn't a bad idea.  Maybe it will make more people excited about science and help with funding in the long run.  It's also fun to try to explain what you're doing to others, even if it's on a high level.

Unfortunately, there's still temptation for scientists to go into too much detail.  We can't help ourselves but to bore everyone around us.  This is where twitter comes to the rescue.  Its 140 character limit forces us to be concise, so if we're going to talk about science at all, we have to choose our words carefully.  So, when Sean Carroll, a physicist at CalTech, entertained a request to explain M-theory on twitter, and attempted to do it using only 3 tweets, he opened a floodgate of other scientists trying to explain the major ideas in their fields in just 3 tweets.  Check out the #3tweets hashtag, and you'll see all sorts of interesting things posted.

Sticking to three tweets strikes a balance between a blog post (which won't get a large readership) and just 1 tweet (in which one cannot explain anything).  And if you have followers who are reading your twitter stream, they won't be able to avoid reading some science.

I, too, got tempted and did my own three tweets on the Church-Turing thesis (to be read bottom-up):

If you have a twitter account (and if you don't, get one), try explaining something about what you do, whether it's science or not, in just three tweets (and don't forget to use the #3tweets hashtag).  It's harder than it seems.

Sean Carroll also blogged about this.

Wednesday, March 30, 2011

Paperless Problems

While listening to an academic talk at a conference, I like to read the corresponding paper being presented.  I follow along by taking notes right in the paper proceedings and marking things that will be of interest to me later.  Then on the flight home, I leisurely browse through the proceedings, looking at my notes.

Apparently, I am also one of only a few people to do this because conferences, almost uniformly, have been getting rid of paper proceedings.  Instead, we get the proceedings on CD or USB stick, or the proceedings are put online, and we are (in theory) supposed to follow along on our laptops or to print the papers interesting to us in advance of the talks.

Unfortunately, this never really worked for me.  My laptop 1) doesn't have enough battery life to last the entire conference* 2) does not allow me to take notes on the paper 3) is too distracting to be used effectively.  These problems are solvable in theory, but it seems in theory only. I could (and sometimes do) fight for the occasional outlet. Technically, there's software that will let me write on pdfs, but it never works properly, and I'm no good at "writing" with my mouse. And I could try to not get distracted on my laptop, but it's a losing battle.  

For all our technical sophistication, we can't yet beat the simplicity of paper.  It's easy to write on and even has great battery life and resolution! But the solution of printing interesting papers in advance is no good because I don't know which papers are interesting to me until I actually go to the talk -- I think of talks as advertisements for papers.  If a talk is good, I'll read the paper afterwards.  If a talk is boring, I probably won't, unless I need to directly for research.

So, desperate for a solution, I got a Kindle so that I could load the papers in advance and read them during the conference.  The latest Kindle even advertised the ability to mark up pdf documents.  However, I quickly discovered that this was no solution at all.  The Kindle interface is so awful, that I managed to do none of these things. The only thing one can realistically do with a Kindle is read novels, preloaded in Amazon's format.  Actually the Kindle did improve my conference travel experience in that I no longer have to lug around books to read for fun.  Unfortunately, my original problem remains.

Now, I do realize that printing costs money, uses paper, and produces big proceedings that are no fun to lug around.  For big conferences like NIPS, I agree that printing all the papers would probably be too much.  And I'm confident that technology will be good enough in a couple years that this will no longer be a problem, making this inconvenience only temporary.  Perhaps going paperless is even the right solution at this time, though I tend to think we abandoned paper too early.

But I don't understand what everyone does in the meantime.  I've never seen anyone follow a paper on a laptop screen, nor have I ever witnessed anyone actually printing papers in advance of talks.  So what should I do?  Should I get an iPad -- will it solve any of my problems?  Is there something everyone is doing that I'm missing?  I don't want to have to bring up the old paper/electronic proceedings debate at the next business meeting!**

* My new laptop's battery might just last long enough, but the other issues remain with using a laptop.
** Okay, I'm bluffing.

Tuesday, February 15, 2011

Elementary, My Dear Watson?

As many of you know, this week IBM's computer system, Watson, is competing on Jeopardy against its two strongest performers, Ken Jennings and Brad Rutter. At the time of this post, their first match has finished, and Watson is ahead by a large margin.  Watson's lead is so great that it would be quite surprising if Ken or Brad were to catch up.  Given I do machine learning research (though I'd more call the Jeopardy task AI than ML), I couldn't resist posting about this match.

Ever since witnessing humanity's line fall when Deep Blue defeated Kasparov over a decade ago, we humans have become accustomed to computers outperforming us at various tasks.  Computers were first built precisely to do computations quicker and more accurately than we could hope to.  And even though winning at chess takes a lot more than brute-force computing power (if a computer really tried to calculate all possible chess move sequences, it would take more than the current age of the universe for it to finish), chess seems like one of those activities that computers should be good at.  Actually, the best humans can still easily defeat the best computer systems at Go, but few of us will be surprised when, in a couple years, this ceases to be the case.

Jeopardy is another thing all-together.  One might at first ask what possible hope is there of beating a computer at trivia.  Watson can download pretty much the entire internet into its memory (and has), but the problem is what to do with all this information.  And equally difficult is understanding the Jeopardy questions (or "answers" as they call them) in the first place.  Yesterday, Watson had a lot of trouble with the "name the decade" category precisely because it didn't know the answer had to be a decade.  Or in naming the murderer of Snape and others, Watson couldn't properly rank Voldemort over Harry Potter because, even though it had the entire book in its memory, it had no idea of the concept of murder -- only of word associations, and Harry and murder appear rather frequently together (you-know-who's fault).**

So the difficult part for Watson is exactly the easy part for humans (and vice versa).  Watson can easily store the name of every general, country, and battle, but has a lot more difficulty trying to figure out what is being asked.  Sure there are keywords like "he" or "this date," but when there's any ambiguity, it's quite a challenge.  And even if Watson figures out the answer is a date, it still does massive lookups, searches for correlations, runs machine learning algorithms, etc.  At any point, something can go wrong because Watson doesn't "understand" the way we do.  (Interestingly enough, as Louis von Ahn observes on twitter, the final Jeopardy question about city airport names was pretty hard to answer even for us humans using a search engine, though it wasn't so hard for Ken or Brad, but no human would answer Toronto for "U.S. cities").  Ken and Brad need no help interpreting the questions, but to them, remembering the ridiculous amount of information is the hard part.

That being said, the IBM team has done a great job with Watson.  Watson's performance this far already heralds other incredibly useful applications, which are not so unlike the Star Trek TNG computer systems -- only they'll come much before the 24th century.  Don't get me wrong; there's still quite a bit to go.  Real-life speech doesn't come in the form of nicely-formatted Jeopardy riddles (typed to Watson, who still doesn't have speech recognition), but this demonstration shows we've passed a major hurdle.

Finally, I should say that even though I believe that these technological changes are ultimately for the better, I am rooting for Ken and Brad*.  Once Watson can beat us, there's no going back.  In 10 more years, your typical personal computer, in whatever phone/laptop/terminal/watch shape it is, will run circles around Watson.

Yet, whatever discomfort I have will probably soon go away, and I'll become accustomed to computers being better than we are at yet another thing.  I'll take comfort in my personal better-than-Watson computer.  I'll enjoy the efficiency this brings.  I'll celebrate of the new scientific and medical breakthroughs we'll make using the help of (and closer interaction with) ever more powerful computers.  I'll, for one, eventually welcome our trivia overlords.

Update (2/16/11): Watson won today, as expected. And Daniel Reeves has some interesting ideas on how to change the rules of the game.

*I also felt that the game was a bit rigged against Ken and Brad -- the computer can always buzz in faster, so Ken and Brad have to compete for who wins the more human questions.  Were the game 1 on 1, it would have been a lot closer.  And if it were two Watsons vs just Ken or Brad, I'm sure the humans would win.

**Showing, once more that humans are decidedly more human, as we can easily tell the difference between Harry and Voldemort.

The image is of IBM's Watson Avatar and taken from Wikipedia.  It is posted for commentary under "fair use."