» tagged pages
» logout

sorted by: recent | see : popular
Content Tagged with privacy + search

Using data to fight webspam



This post is the latest in an ongoing series about how we harness the data we collect to improve our products and services for our users. - Ed.

As the head of the webspam team at Google, I'm in charge of making sure your search results are as relevant and informative as possible. Webspam, in case you've never heard of it, is the junk you see in search results when websites successfully cheat their way into higher positions in search results or otherwise violate search engine quality guidelines. If you've never seen webspam, here's a good example of what you might see if you click on a link in the search results that's spam (click on the image to see it larger).



You can see how unhelpful such a page would be. This example is filled with almost no original content, irrelevant links, and information that is of little use to a user. We work hard to ensure you rarely see search results like this. Imagine how annoyed you would be if you clicked on a link from a Google search result and ended up on a page like this.

Searchers don't often see blatant, outright spam like this in search results today. But webspam was much more of an issue before Google became popular and before we were able to build effective anti-spam methods. In general, webspam can be a real annoyance, such as when a search on your own name returns links to porn pages as results. But for many searches, where getting relevant information is more critical, spam is a serious problem. For example, a search for prostate cancer that's full of spam instead of relevant links greatly diminishes the value of a search engine as a helpful tool.

Data from search logs is one tool we use to fight webspam and return cleaner and more relevant results. Logs data such as IP address and cookie information make it possible to create and use metrics that measure the different aspects of our search quality (such as index size and coverage, results "freshness," and spam).

Whenever we create a new metric, it's essential to be able to go over our logs data and compute new spam metrics using previous queries or results. We use our search logs to go "back in time" and see how well Google did on queries from months before. When we create a metric that measures a new type of spam more accurately, we not only start tracking our spam success going forward, but we also use logs data to see how we were doing on that type of spam in previous months and years.

The IP and cookie information is important for helping us apply this method only to searches that are from legitimate users as opposed to those that were generated by bots and other false searches. For example, if a bot sends the same queries to Google over and over again, those queries should really be discarded before we measure how much spam our users see. All of this--log data, IP addresses, and cookie information--makes your search results cleaner and more relevant.

If you think webspam is a solved problem, think again. Last year Google faced a rash of webspam on Chinese domains in our index. Some spammers were purchasing large amounts of cheap .cn domains and stuffing them with misspellings and porn phrases. Savvy users may remember reading a few blogs about it, but most regular users never even noticed. The reason that a typical searcher didn't notice the odd results is that Google identified the .cn spam and responded with a fast-tracked engineering project to counteract that type of spam attack. Without our logs data to help identify the speed and scope of the problem, many more Google users might have been affected by this attack.

In an ideal world, the vast majority of our users wouldn't even need to know that Google has a webspam team. If we do our job well, you may see low-quality results from time to time, but you won't have to face sneaky JavaScript redirects, unwanted porn, gibberish-stuffed pages or other types of webspam. Our logs data helps ensure that Google detects and has a chance to counteract new spam trends before it lowers the quality of your search experience.

Update: Enlarged image.

Google: Official Google Blog

Why data matters



We often use this space to discuss how we treat user data and protect privacy. With the post below, we're beginning an occasional series that discusses how we harness the data we collect to improve our products and services for our users. We think it's appropriate to start with a post describing how data has been critical to the advancement of search technology. - Ed.

Better data makes for better science. The history of information retrieval illustrates this principle well.

Work in this area began in the early days of computing, with simple document retrieval based on matching queries with words and phrases in text files. Driven by the availability of new data sources, algorithms evolved and became more sophisticated. The arrival of the web presented new challenges for search, and now it is common to use information from web links and many other indicators as signals of relevance.

Today's web search algorithms are trained to a large degree by the "wisdom of the crowds" drawn from the logs of billions of previous search queries. This brief overview of the history of search illustrates why using data is integral to making Google web search valuable to our users.

A brief history of search

Nowadays search is a hot topic, especially with the widespread use of the web, but the history of document search dates back to the 1950s. Search engines existed in those ancient times, but their primary use was to search a static collection of documents. In the early 60s, the research community gathered new data by digitizing abstracts of articles, enabling rapid progress in the field in the 60s and 70s. But by the late 80s, progress in this area had slowed down considerably.

In order to stimulate research in information retrieval, the National Institute of Standards and Technology (NIST) launched the Text Retrieval Conference (TREC) in 1992. TREC introduced new data in the form of full-text documents and used human judges to classify whether or not particular documents were relevant to a set of queries. They released a sample of this data to researchers, who used it to train and improve their systems to find the documents relevant to a new set of queries and compare their results to TREC's human judgments and other researchers' algorithms.

The TREC data revitalized research on information retrieval. Having a standard, widely available, and carefully constructed set of data laid the groundwork for further innovation in this field. The yearly TREC conference fostered collaboration, innovation, and a measured dose of competition (and bragging rights) that led to better information retrieval.

New ideas spread rapidly, and the algorithms improved. But with each new improvement, it became harder and harder to improve on last year's techniques, and progress eventually slowed down again.

And then came the web. In its beginning stages, researchers used industry-standard algorithms based on the TREC research to find documents on the web. But the need for better search was apparent--now not just for researchers, but also for everyday users---and the web gave us lots of new data in the form of links that offered the possibility of new advances.

There were developments on two fronts. On the commercial side, a few companies started offering web search engines, but no one was quite sure what business models would work.

On the academic side, the National Science Foundation started a "Digital Library Project" which made grants to several universities. Two Stanford grad students in computer science named Larry Page and Sergey Brin worked on this project. Their insight was to recognize that existing search algorithms could be dramatically improved by using the special linking structure of web documents. Thus PageRank was born.

How Google uses data

PageRank offered a significant improvement on existing algorithms by ranking the relevance of a web page not by keywords alone but also by the quality and quantity of the sites that linked to it. If I have six links pointing to me from sites such as the Wall Street Journal, New York Times, and the House of Representatives, that carries more weight than 20 links from my old college buddies who happen to have web pages.

Larry and Sergey initially tried to license their algorithm to some of the newly formed web search engines, but none were interested. Since they couldn't sell their algorithm, they decided to start a search engine themselves. The rest of the story is well-known.

Over the years, Google has continued to invest in making search better. Our information retrieval experts have added more than 200 additional signals to the algorithms that determine the relevance of websites to a user's query.

So where did those other 200 signals come from? What's the next stage of search, and what do we need to do to find even more relevant information online?

We're constantly experimenting with our algorithm, tuning and tweaking on a weekly basis to come up with more relevant and useful results for our users.

But in order to come up with new ranking techniques and evaluate if users find them useful, we have to store and analyze search logs. (Watch our videos to see exactly what data we store in our logs.) What results do people click on? How does their behavior change when we change aspects of our algorithm? Using data in the logs, we can compare how well we're doing now at finding useful information for you to how we did a year ago. If we don't keep a history, we have no good way to evaluate our progress and make improvements.

To choose a simple example: the Google spell checker is based on our analysis of user searches compiled from our logs -- not a dictionary. Similarly, we've had a lot of success in using query data to improve our information about geographic locations, enabling us to provide better local search.

Storing and analyzing logs of user searches is how Google's algorithm learns to give you more useful results. Just as data availability has driven progress of search in the past, the data in our search logs will certainly be a critical component of future breakthroughs.

Google: Official Google Blog

CustomizeGoogle: Improve Your Google Experience -- Firefox Extension

CustomizeGoogle is a Firefox extension that enhances Google search results by adding extra information (like links to Yahoo, Ask.com, MSN etc) and removing unwanted information (like ads and spam).

Firefox: del.icio.us/tag/firefox

TrackMeNot

"TrackMeNot is a lightweight browser extension that helps protect web searchers from surveillance and data-profiling by search engines."

Firefox: del.icio.us/tag/firefox

TrackMeNot

TrackMeNot is a lightweight browser extension that helps protect web searchers from surveillance and data-profiling by search engines. It does so not by means of concealment or encryption (i.e. covering one's tracks), but instead, paradoxically, by the

Firefox: del.icio.us/tag/firefox

Privacy: Keep Your Browsing Private with 10 Firefox Extensions

Linux.com has put together a good overview of Firefox extensions that keep your browsing, searching, and emailing secure and private. A few of these<sep/>

Firefox: del.icio.us/tag/firefox

TrackMeNot Firefox Extension

protect web searchers from surveillance and data-profiling by search engines

Firefox: del.icio.us/tag/firefox

Mozdev.org: Sherlock & OpenSearch Search Engine Plugins

Here are the Search Engine Plugins / Search Providers that match your query. Click once on the plugin name to install. The new engine will appear in the search bar shortly.

Firefox: del.icio.us/tag/firefox

WASTE

strumento anonino, sicuro e criptato di collaborazione che consente agli utenti di condividere idee attraverso l'interfaccia chat e condividere dati tramite il sistema di download

opensource: del.icio.us tag/opensource

[from bushwald] Google's back-up plan: The enterprise

'Asked at a press conference Tuesday if Google is thinking about an alternative business plan for 5 to 10 years out, Chairman and CEO Eric Schmidt didn't miss a beat: "We are, and that's why we have Google Enterprise," he said.'

User:jeyrb: del.icio.us/network/jey

How To Clear Your Google Search History from Browser and Toolbar

One of the main reason why some people would like to clear their Google search history from web browser and Google Toolbar is to maintain their privacy. If your computer is shared with a few people, sometimes it's just not nice to let them "accidentally"

Firefox: del.icio.us/tag/firefox

Page 1 | Next >>