Internet searches have become so much a part of our lives and our work, we rarely think about how they work–not just how does a search engine find information related to our search terms, but how does it then rank the information they find. PageRank is an incredibly important aspect of searching. After all, what good is a search that assigns equal priority to every piece of information matching your search term?
Math editor Vickie Kearn spoke with Sep Kamvar, author of Numerical Algorithms for Personalized Search in Self-organizing Information Networks, about how personal information is used to provide the best results in any given search. Read on to learn why a New Yorker and a San Franciscan don’t get the same results when they search for “Giants”.
What is a self-organizing information network?
Some examples of self-organizing information networks are the World Wide Web, social networks like Facebook, and peer-to-peer networks like Gnutella. In the case of the web, the nodes in the network are individual web pages containing information, and the edges are links between pages. We call these networks self-organizing because there is no central organization of the network — each web page author chooses his or her own links.
In the introduction to your book, you say that you address the issues of relevance by exploiting user preference information to perform personalized search. What user preference information can be used?
There is a wide range of preference information that can be used. A search engine may ask the user directly what they are interested in, or may use implicit context like a user’s location or search history.
What is personalized search?
Personalized search exploits the preferences of an individual to bias search toward that individual’s preferences. For example, a personalized Web search for the term “giants” would return the official site of the New York Giants to a football fan from New York, while the same query would return the official site of the San Francisco Giants to a baseball fan from San Francisco. Personalization takes advantage of the local context to return the right sports team.
What is reputation management?
Reputation management aggregates the preferences of all individuals to bias search toward the data sources that are deemed reputable by the group. In the example above, reputation takes advantage of the global context to return the official site of the corresponding team, rather than some random fan page. In the context of web search, reputation and ranking are closely related; Google’s PageRank algorithm is a classic example of a reputation algorithm on the web.
You state in your book that numerical algorithms, simulation and experimentation are all part of what makes reputation management and personalized search possible. How does this behind-the-scenes science make our lives easier and more enjoyable? How frustrating would searches be if these tools were not used?
They each play different roles. One of the biggest challenges in web search is scale — in personalized web search, it’s necessary to compute the principal eigen vector of a several billion by several billion matrix several hundred million times. Numerical algorithms to compute these efficiently are important to make personalized search possible.
Experimentation is important in any empirical science, and search is no different. Since in search, we are not solving a closed-form problem, but rather exploring a space, it’s important to experiment with different ranking and relevance algorithms and see how users respond.
And simulation is crucial to doing research in peer-to-peer networks, where it’s difficult to deploy new algorithms on existing networks. Simulation allows scientists to experiment with different algorithms and threat models at low cost.
When a person first starts to use the internet, is there a lag time when the personalized search becomes optimal? Is it a case where the more information you have about yourself (meaning the sites you visit) the better your personalized search will become?
While we don’t have a concrete sense of optimal, certainly the more preference information a search engine has, the better it can return more relevant search results.
Do the algorithms in your book help to filter out spam?
In general, personalization counteracts spam because spammers benefit largely from having a single set of results for everybody. With unpersonalized search, if a spammer is able to get their page into the first position for a result, it will appear as the first position for everybody, and there is a large reward for the spammer. With personalized search, if a spammer is able to get their page into the first position for a result for themselves, they have no guarantee that it will be in the first position for everybody else. So the amount of effort required to spam is much more.
We hear about PageRank and know that it has something to do with search engines, but what does it actually do? How important is it to finding what we are looking for when we do an internet search?
PageRank is a ranking algorithm that scores the importance of every page on the web based on the number of pages that link to it, weighted by their importance. PageRank and Anchor Text (analyzing the text of links) are the two central innovations in link analysis in the last decade. Search would be much worse without them.
What are some of the challenges computer scientists face in developing these algorithms?
One of the biggest challenges is in scale. Any algorithm developed for web search needs to be scaled to several billion websites and more than a billion users. Many techniques that are used for more reasonable data sets are intractable at this scale. Another challenge is that every algorithm in search affects every other algorithm. So any change you make to your ranking algorithm will affect your relevance algorithm — even changes you make in your crawling strategy will affect the others. And finally, a key challenge is in evaluation. How do you know that the search results you get are right? Because of the nature of the problem, most evaluation techniques are very human-intensive, but with personalization, they become even more human-intensive.