» tagged pages
» logout

sorted by: recent | see : popular
Content Tagged with privacy + Google

lucidpg - Google Code

This extension provides Firefox users with some functionality related to the OpenPGP standard defined in RFC 4880 among other places<sep/>

Firefox: del.icio.us/tag/firefox

Integrate encryption into Google Calendar with Firefox extensions

Today's Web applications provide many benefits for online storage, access, and collaboration. Although some applications offer encryption of user data, most do not. This article provides tools and code needed to add basic encryption support for user data

Firefox: del.icio.us/tag/firefox

What comes next in this series? 13, 33, 53, 61, 37, 28...



Late one night in the summer of 2000, I found myself answering user support emails in response to two new features we had just released, Advanced Search and Preferences (at the time catchily called "Language, Display, and Filtering Options" :)). Busy crafting answers about how to set Safesearch or change the number of results offered by default, I worked my way through the email queue. And then I saw it: The next email had just a number ("37") in the subject - and no message text. What a weird form of spam, I thought. Why would anyone be motivated to just send a number? I searched for the user's email address to see what else had been sent. Interesting. Lots of numbers: 33, 53, and then a clue: "61, getting a bit heavy, aren't we?" Furthermore, the date on each of the messages seemed very familiar. Then I realized that's because the dates were all days that I had launched various changes on the homepage. "Getting a bit heavy?" - that one did correspond to one of the wordiest homepage releases we had ever done. Could the sender be counting words? Sure enough, I looked back, counted the words myself, and he was - a manual, human version of a scale for the Google homepage. He was weighing our homepage and letting us know when it was getting too heavy. One of his earliest mails had a note in the body: "What happened to the days of 13?" - referring to the word count on the initial 1999 homepage.

This mystery and its revelation was really interesting because I thought about the homepage, and how to keep it simple, all the time. Yet I hadn't thought to look at it through this very simple lens: just count the words. The fewer, the better. Ever since that night, this has been our discipline, and everyone who works on the homepage and its design knows the current number: 28. (That's the word count for the basic page if you are signed out, there's no promotional line running beneath the search box, you've set Google as your homepage and thus don't get the "Make Google Your Homepage!" link, and you count "©2008 Google" as two words.)

So, today we're making a homepage change by adding a link to our privacy overview and policies. Google values our users' privacy first and foremost. Trust is the basis of everything we do, so we want you to be familiar and comfortable with the integrity and care we give your personal data. We added this link both to our homepage and to our results page to make it easier for you to find information about our privacy principles. The new "Privacy" link goes to our Privacy Center, which was revamped earlier this year to be more straightforward and approachable, with videos and a non-legalese overview to make sure you understand in basic terms what Google does, does not, will, and won't, do in regard to your personal information.

How does privacy relate to homepage word count? Larry and Sergey told me we could only add this to the homepage if we took a word away - keeping the "weight" of the homepage unchanged at 28. Given that the new Privacy link fit best with legal disclaimers on the page, I looked to the copyright line. There, we dropped the word "Google" (realizing it was implied, obviously) and added the new privacy link alongside it.



We think the easy access to our privacy information without any added homepage heft is a clear win for our users and an enhancement to your experience. You can check out the new Privacy Center here.

Google: Official Google Blog

Using data to fight webspam



This post is the latest in an ongoing series about how we harness the data we collect to improve our products and services for our users. - Ed.

As the head of the webspam team at Google, I'm in charge of making sure your search results are as relevant and informative as possible. Webspam, in case you've never heard of it, is the junk you see in search results when websites successfully cheat their way into higher positions in search results or otherwise violate search engine quality guidelines. If you've never seen webspam, here's a good example of what you might see if you click on a link in the search results that's spam (click on the image to see it larger).



You can see how unhelpful such a page would be. This example is filled with almost no original content, irrelevant links, and information that is of little use to a user. We work hard to ensure you rarely see search results like this. Imagine how annoyed you would be if you clicked on a link from a Google search result and ended up on a page like this.

Searchers don't often see blatant, outright spam like this in search results today. But webspam was much more of an issue before Google became popular and before we were able to build effective anti-spam methods. In general, webspam can be a real annoyance, such as when a search on your own name returns links to porn pages as results. But for many searches, where getting relevant information is more critical, spam is a serious problem. For example, a search for prostate cancer that's full of spam instead of relevant links greatly diminishes the value of a search engine as a helpful tool.

Data from search logs is one tool we use to fight webspam and return cleaner and more relevant results. Logs data such as IP address and cookie information make it possible to create and use metrics that measure the different aspects of our search quality (such as index size and coverage, results "freshness," and spam).

Whenever we create a new metric, it's essential to be able to go over our logs data and compute new spam metrics using previous queries or results. We use our search logs to go "back in time" and see how well Google did on queries from months before. When we create a metric that measures a new type of spam more accurately, we not only start tracking our spam success going forward, but we also use logs data to see how we were doing on that type of spam in previous months and years.

The IP and cookie information is important for helping us apply this method only to searches that are from legitimate users as opposed to those that were generated by bots and other false searches. For example, if a bot sends the same queries to Google over and over again, those queries should really be discarded before we measure how much spam our users see. All of this--log data, IP addresses, and cookie information--makes your search results cleaner and more relevant.

If you think webspam is a solved problem, think again. Last year Google faced a rash of webspam on Chinese domains in our index. Some spammers were purchasing large amounts of cheap .cn domains and stuffing them with misspellings and porn phrases. Savvy users may remember reading a few blogs about it, but most regular users never even noticed. The reason that a typical searcher didn't notice the odd results is that Google identified the .cn spam and responded with a fast-tracked engineering project to counteract that type of spam attack. Without our logs data to help identify the speed and scope of the problem, many more Google users might have been affected by this attack.

In an ideal world, the vast majority of our users wouldn't even need to know that Google has a webspam team. If we do our job well, you may see low-quality results from time to time, but you won't have to face sneaky JavaScript redirects, unwanted porn, gibberish-stuffed pages or other types of webspam. Our logs data helps ensure that Google detects and has a chance to counteract new spam trends before it lowers the quality of your search experience.

Update: Enlarged image.

Google: Official Google Blog

Changing Internet and Privacy Erosion

God, how I miss Scott McNealy, Sun Microsystems co-founder & former CEO, and his off-the-cuff but often prescient quotes. Back in 1999, much to the chagrin of privacy advocates, he quipped, “You have zero privacy anyway… Get over it.” Recent developments only give credence to his flippant observation from almost a decade ago.

  • There is legislation in front of the Senate that will pretty much require payment systems to report information on nearly every electronic transaction to the federal government. This is going to impact eBay, Amazon and Google, along with credit card companies. The requirement is buried in Senator Christopher Dodd’s 630-page Senate housing legislation
  • The U.S. Congress has worked out a deal that gives phone companies immunity from participating in warrantless eavesdropping.
  • Sweden recently passed a new law that allowed the Swedish government to eavesdrop on Internet traffic as it sees fit.
  • France is banning illegal Internet downloaders, according to Times UK Online, following some other bans.

Technology-News: GigaOm

Google Secure Pro – Userscripts.org

"Forces gMail, gCal, Google Docs, History, Bookmarks and Reader to use (https) secure connection"

Firefox: del.icio.us/tag/firefox

Download of the Day: FireGPG (Firefox)

Windows/Linux (Firefox): The FireGPG Firefox extension tightly integrates GPG encryption into your Gmail account.

Firefox: del.icio.us/tag/firefox

FireGPG - use GPG easily in Firefox !

FireGPG is a Firefox extension under MPL which brings an interface to encrypt, decrypt, sign or verify the signature of text in any web page using GnuPG.

Firefox: del.icio.us/tag/firefox

Does your password pass the test?



This post is the latest in an ongoing series about online safety. - Ed.

One of the things I work on is password security. And because I'm someone who pays close attention to passwords and how people use them, I sometimes hear interesting stories. For example, a couple of my colleagues are so careful about the security of their passwords that they generate a random eight-character string, memorize it, and then use it as their password for two to three months. After that time elapses, they start the process over again and generate a new random password.

Do we all need to be that careful about our passwords? Probably not. But passwords are one of the web's most important security tools. Whether it's for your Google account, your banking center, or your favorite store, choosing a good password and keeping it safe can go a long way toward protecting your information online.

So how do you choose a good password, and then keep it safe? A few of these tips can help:
  • Avoid common elements when choosing your password. Specifically, you should avoid using words or phases from the dictionary, especially things that are easy to guess, like "password," "let me in," or the name of the site you're logging into. You should also avoid using keyboard patterns, such as "asdf1234" or "aqswdefr," or personal information, such as birthdays, addresses, or phone numbers.
  • Make your password as unique as possible. Once you've settled on a good base for your password, you should go a step further and add in numbers and non-alphanumerical characters, mix in upper-case letters, or use similar-looking substitutions for parts of the password, such as "$" for "s," "1" for "l," and "0" for "o."
  • Create different passwords for different sites. Doing so will help ensure that if one password is compromised, the others will remain secure. You may not be able to have a unique password for every place you visit on the web (for some of us, that would be a lot of passwords to manage), but alternating between a set of different passwords across the web and making sure all accounts that contain highly sensitive information (like email accounts or online banking accounts) have unique passwords is a good place to start.
  • Don't share your passwords with anyone. Not family, not friends, not anyone. This may seem a little strict, but the reality is the more people you share your password with, the greater your chances of having that password compromised will be. Also, if you need to write your passwords down, keep them away from your computer, and never send them in emails. And if you suspect someone might have discovered one of your passwords, change it immediately.
  • Be careful how you share your information online. Some online services -- such as social networking sites and gadgets that scrape information from other products -- may ask you for a password or an API key. If you choose to use these kinds of services, take a few minutes to learn more about what they do to keep your sensitive information secure. And just like sharing passwords with other people, you should be aware that sharing this information increases the chances that it could be compromised.
Another thing that can help keep your password secure is choosing a good security question and answer on the sites that offer that option. You've probably seen this before: When you're creating an account on many sites, you will be asked to choose a question to verify your identity if you forget your password.

Some sites will let you write in your own question; in these cases, you should make sure the Q&A you create isn't something that's easy to guess or something that your family and friends would know. Other sites will present you with a list of preset questions to choose from, such as "What is your mother's maiden name?" These kinds of questions are less secure, as they're easier for other people to guess the answer. In these cases, you should find a way to make your answer unique -- whether it's using the tips above, or by adding in other information -- so that even if someone guesses the answer, they won't know how to enter it properly.

Read more about choosing a good password and security question.

Google: Official Google Blog

Privacy made easier



Because we're strongly committed to protecting your privacy, we want to present our privacy practices in the clearest way possible. Over the past year, we've been experimenting with video to clarify and illustrate the privacy practices set forth in our Google Privacy Policy. We've used videos to communicate with you about things like cookies, IP addresses, and logs. (Check out the Google Privacy Channel on YouTube.) And you've told us that the screen shots, whiteboard drawings, and pointers from the engineers and product managers we've captured on video are helping you better understand the fine points of our Privacy Policy.

With that in mind, today we're announcing a revamp of our Privacy Center. The new Center is a one-stop shop for privacy resources, with various multi-media formats aimed to help you further understand how we store and use data, how to control who you share your data with, and how we protect your privacy. We hope this new Center will help you make more informed privacy choices whenever you use Google products and services.

Google: Official Google Blog

Making search better in Catalonia, Estonia, and everywhere else



We recently began a series of posts on how we harness the power of data. Earlier we told you how data has been critical to the advancement of search; about using data to make our products safe and to prevent fraud; this post is the newest in the series. -Ed.

One of the most important uses of data at Google is building language models. By analyzing how people use language, we build models that enable us to interpret searches better, offer spelling corrections, understand when alternative forms of words are needed, offer language translation, and even suggest when searching in another language is appropriate.

One place we use these models is to find alternatives for words used in searches. For example, for both English and French users, "GM" often means the company "General Motors," but our language model understands that in French searches like seconde GM, it means "Guerre Mondiale" (World War), whereas in STI GM it means "Génie Mécanique" (Mechanical Engineering). Another meaning in English is "genetically modified," which our language model understands in GM corn. We've learned this based on the documents we've seen on the web and by observing that users will use both "genetically modified" and "GM" in the same set of searches.

We use similar techniques in all languages. For example, if a Catalan user searches for resultat elecció barris BCN (searching for the result of a neighborhood election in Barcelona), Google will also find pages that use the words "resultats" or "eleccions" or that talk about "Barcelona" instead of "BCN." And our language models also tell us that the Estonian user looking for Tartu juuksur, a barber in Tartu, might also be interested in a "juuksurisalong," or "barber shop."

In the past, language models were built from dictionaries by hand. But such systems are incomplete and don't reflect how people actually use language. Because our language models are based on users' interactions with Google, they are more precise and comprehensive -- for example, they incorporate names, idioms, colloquial usage, and newly coined words not often found in dictionaries.

When building our models, we use billions of web documents and as much historical search data as we can, in order to have the most comprehensive understanding of language possible. We analyze how our users searched and how they revised their searches. By looking across the aggregated searches of many users, we can infer the relationships of words to each other.

Queries are not made in isolation -- analyzing a single search in the context of the searches before and after it helps us understand a searcher's intent and make inferences. Also, by analyzing how users modify their searches, we've learned related words, variant grammatical forms, spelling corrections, and the concepts behind users' information needs. (We're able to make these connections between searches using cookie IDs -- small pieces of data stored in visitors' browsers that allow us to distinguish different users. To understand how cookies work, watch this video.)

To provide more relevant search results, Google is constantly developing new techniques for language modeling and building better models. One element in building better language models is using more data collected over longer periods of time. In languages with many documents and users, such as English, our language models allow us to improve results deep into the "long tail" of searches, learning about rare usages. However, for languages with fewer users and fewer documents on the web, building language models can be a challenge. For those languages we need to work with longer periods of data to build our models. For example, it takes more than a year of searches in Catalan to provide a comparable amount of data as a single day of searching in English; for Estonian, more than two and a half years worth of searching is needed to match a day of English. Having longer periods of data enables us to improve search for these less commonly used languages.

At Google, we want to ensure that we can help users everywhere find the things they're looking for; providing accurate, relevant results for searches in all languages worldwide is core to Google's mission. Building extensive models of historical usage in every language we can, especially when there are few users, is an essential piece of making search work for everyone, everywhere.

Google: Official Google Blog

Using data to help prevent fraud



We recently began a series of posts on how we harness the power of data. Earlier we told you how data has been critical to the advancement of search technology. Then we shared how we use log data to help make Google products safer for users. This post is the newest in the series. -Ed.

Protecting our advertisers against click fraud is a lot like solving a crime: the more clues we have, the better we can determine which clicks to mark as invalid, so advertisers are not charged for them.

As we've mentioned before, our Ad Traffic Quality team built, and is constantly adding to, our three-stage system for detecting invalid clicks. The three stages are: (1) proactive real-time filters, (2) proactive offline analysis, and (3) reactive investigations.

So how do we use logs information for click fraud detection? Our logs are where we get the clues for the detective work. Logs provide us with the repository of data which are used to detect patterns, anomalous behavior, and other signals indicative of click fraud.

Millions of users click on AdWords ads every day. Every single one of those clicks -- and the even more numerous impressions associated with them -- is analyzed by our filters (stage 1), which operate in real-time. This stage certainly utilizes our logs data, but it is stages 2 and 3 which rely even more heavily on deeper analysis of the data in our logs. For example, in stage 2, our team pores over the millions of impressions and clicks -- as well as conversions -- over a longer time period. In combing through all this information, our team is looking for unusual behavior in hundreds of different data points.

IP addresses of computers clicking on ads are very useful data points. A simple use of IP addresses is determining the source location for traffic. That is, for a given publisher or advertiser, where are their clicks coming from? Are they all coming from one country or city? Is that normal for an ad of this type? Although we don't use this information to identify individuals, we look at these in aggregate and study patterns. This information is imperfect, but by analyzing a large volume of this data it is very helpful in helping to prevent fraud. For example, examining an IP address usually tells us which ISP that person is using. It is easy for people on most home Internet connections to get a new IP address by simply rebooting their DSL or cable modem. However, that new IP address will still be registered to their ISP, so additional ad clicks from that machine will still have something in common. Seeing an abnormally high number of clicks on a single publisher from the same ISP isn't necessarily proof of fraud, but it does look suspicious and raises a flag for us to investigate. Other information contained in our logs, such as the browser type and operating system of machines associated with ad clicks, are analyzed in similar ways.

These data points are just a few examples of hundreds of different factors we take into account in click fraud detection. Without this information, and enough of it to identify fraud attempted over a longer time period, it would be extremely difficult to detect invalid clicks with a high degree of confidence, and proactively create filters that help optimize advertiser ROI. Of course, we don't need this information forever; last year we started anonymizing server logs after 18 months. As always, our goal is to balance the utility of this information (as we try to improve Google’s services for you) with the best privacy practices for our users.

If you want to learn more about how we collect information to better detect click fraud, visit our Ad Traffic Quality Resource Center.

Google: Official Google Blog

Using log data to help keep you safe



We recently began two new series of posts. The first, which explains how we harness data for our users, started with this post. The second, focusing on how we secure information and how users can protect themselves online, began here. This post is the second installment in both series.- Ed.

We sometimes get questions on what Google does with server log data, which registers how users are interacting with our services. We take great care in protecting this data, and while we've talked previously about some of the ways it can be useful, something we haven't covered yet are the ways it can help us make Google products safer for our users.

While the Internet on the whole is a safe place, and most of us will never fall victim to an attack, there are more than a few threats out there, and we do everything we can to help you stay a step ahead of them. Any information we can gather on how attacks are launched and propagated helps us do so.

That's where server log data comes in. We analyze logs for anomalies or other clues that might suggest malware or phishing attacks in our search results, attacks on our products and services, and other threats to our users. And because we have a reasonably significant data sample, with logs stretching back several months, we're able to perform aggregate, long-term analyses that can uncover new security threats, provide greater understanding of how previous threats impacted our users, and help us ensure that our threat detection and prevention measures are properly tuned.

We can't share too much detail (we need to be careful not to provide too many clues on what we look for), but we can use historical examples to give you a better idea of how this kind of data can be useful. One good example is the Santy search worm (PDF), which first appeared in late 2004. Santy used combinations of search terms on Google to identify and then infect vulnerable web servers. Once a web server was infected, it became part of a botnet and started searching Google for more vulnerable servers. Spreading in this way, Santy quickly infected thousands and thousands of web servers across the Internet.

As soon as Google recognized the attack, we began developing a series of tools to automatically generate "regular expressions" that could identify potential Santy queries and then block them from accessing Google.com or flag them for further attention. But because regular expressions like these can sometimes snag legitimate user queries too, we designed the tools so they'd test new expressions in our server log databases first, in order to determine how each one would affect actual user queries. If it turned out that a regular expression affected too many legitimate user queries, the tools would automatically adjust the expression, analyze its performance against the log data again, and then repeat the process as many times as necessary.

In this instance, having access to a good sample of log data meant we were able to refine one of our automated security processes, and the result was a more effective resolution of the problem. In other instances, the data has proven useful in minimizing certain security threats, or in preventing others completely. In the end, what this means is that whenever you use Google search, or Google Apps, or any of our other services, your interactions with those products helps us learn more about security threats that could impact your online experience. And the better the data we have, the more effectively we can protect all our users.

Google: Official Google Blog

How Google keeps your information secure



As many of you know, we spend a lot of time around here thinking about new products to help you run your life more efficiently, whether that’s organizing email in a better way, sharing pictures with friends, or collaborating in real time on documents. What you may not know is that we also spend a lot of time thinking about the security that goes into those products, and more specifically the ways we can protect you and your private information.

While the chances are that you'll never have a security problem, we take security very seriously, and that's why we have some of the best engineers in the world working here to secure information. Much of their work is confidential, but we do want to share some of the ways we're protecting your data. There are a few things you should know about how we handle confidential information:
  • Philosophy: First is our philosophy. At Google, security is a continuous process. We don't just "check" a product for security before we launch it -- we are thinking about security before the product is even created, and we are building it in throughout the product's development. Also critical is our belief in layered protection. It's much like securing your house. You put your most private information in a safe. You secure the safe in your house, which is protected with locks and possibly an alarm system. And then you have the neighborhood watch program or the local police monitoring your neighborhood. It's very similar at Google. Our most sensitive information is difficult to find or access (the safe). Our network and facilities (the house) are protected in both high- and low-tech ways: encryption, alarms, and other technology for our systems, and strong physical security at our facilities. And finally, we've learned that when security is done right, it's done best as a community (the neighborhood); we encourage everyone to help us identify potential problems and solutions. Researchers who work at security and technology companies all over the world are constantly looking for security problems on the Internet, and we work closely with that community to find and fix potential problems.
  • Technology: These layers of protection are built on the best security technology in the world. While we employ products developed by others in the security community, we build a lot of our security technology ourselves. Some of the most innovative components of our security architecture focus on automation and scale. These are important to us because we're handling searches, emails, and other activities for millions of users every day. To keep our security processes a step ahead, we automate the way we test our software for possible security vulnerabilities and the way we monitor for possible security attacks. We're also constantly seeking more ways to use encryption and other technical measures to protect your data, while still maintaining a great user experience.
  • Process: In addition to technology, we have a set of processes that dictate how we secure confidential information at Google and who can access it. We carefully manage access to confidential information of any sort, and very few Googlers have access to what we consider very sensitive data. This is in no small part because there's very little reason for us to provide that access -- most of our processes are automated, and don't require much human intervention. Of course, the limited number of people who are granted access to sensitive data must have special approval. And while we hold ourselves to a very high standard, we also work to ensure that our processes meet (and in many cases exceed) industry standards. These include audits for Sarbanes-Oxley, SAS 70, PCI (payment card industry) compliance, and more. By working with independent auditors, who evaluate compliance with standards that hold hundreds of different companies to very rigorous requirements, we add another layer of checks and balances to our security processes.
  • People: The most important part of our approach to security is our people. Google employs some of the best and brightest security engineers in the world. Many of our engineers came from very high-profile security environments, such as banks, credit card companies, and high-volume retail organizations, and a large number of them hold PhDs and patents in security and software engineering. As you can imagine, our engineers are smart and curious and are on the lookout for security anomalies and best practices in the industry. Our engineers have published hundreds of academic papers on technically detailed topics such as drive-by downloads that install malware (PDF file) or hostile virtualized environments. (You can find some of these papers here.) What's more, we cultivate a collaborative approach to security among all of our engineers, requiring everyone to pass a coding style review (which enables us to control the type of code used here and how it's used in order to prevent software problems) and ensuring that all code at Google is reviewed by multiple engineers so that it meets our software and security standards.
And throughout the company, we use our own products. That means we protect your information with the same security that we use to protect our own company emails and documents. And while we continue to innovate with our products, we'll also continue to innovate in the world of security. For more on our approach to security, visit our Security and Product Safety page.

Google: Official Google Blog

Why data matters



We often use this space to discuss how we treat user data and protect privacy. With the post below, we're beginning an occasional series that discusses how we harness the data we collect to improve our products and services for our users. We think it's appropriate to start with a post describing how data has been critical to the advancement of search technology. - Ed.

Better data makes for better science. The history of information retrieval illustrates this principle well.

Work in this area began in the early days of computing, with simple document retrieval based on matching queries with words and phrases in text files. Driven by the availability of new data sources, algorithms evolved and became more sophisticated. The arrival of the web presented new challenges for search, and now it is common to use information from web links and many other indicators as signals of relevance.

Today's web search algorithms are trained to a large degree by the "wisdom of the crowds" drawn from the logs of billions of previous search queries. This brief overview of the history of search illustrates why using data is integral to making Google web search valuable to our users.

A brief history of search

Nowadays search is a hot topic, especially with the widespread use of the web, but the history of document search dates back to the 1950s. Search engines existed in those ancient times, but their primary use was to search a static collection of documents. In the early 60s, the research community gathered new data by digitizing abstracts of articles, enabling rapid progress in the field in the 60s and 70s. But by the late 80s, progress in this area had slowed down considerably.

In order to stimulate research in information retrieval, the National Institute of Standards and Technology (NIST) launched the Text Retrieval Conference (TREC) in 1992. TREC introduced new data in the form of full-text documents and used human judges to classify whether or not particular documents were relevant to a set of queries. They released a sample of this data to researchers, who used it to train and improve their systems to find the documents relevant to a new set of queries and compare their results to TREC's human judgments and other researchers' algorithms.

The TREC data revitalized research on information retrieval. Having a standard, widely available, and carefully constructed set of data laid the groundwork for further innovation in this field. The yearly TREC conference fostered collaboration, innovation, and a measured dose of competition (and bragging rights) that led to better information retrieval.

New ideas spread rapidly, and the algorithms improved. But with each new improvement, it became harder and harder to improve on last year's techniques, and progress eventually slowed down again.

And then came the web. In its beginning stages, researchers used industry-standard algorithms based on the TREC research to find documents on the web. But the need for better search was apparent--now not just for researchers, but also for everyday users---and the web gave us lots of new data in the form of links that offered the possibility of new advances.

There were developments on two fronts. On the commercial side, a few companies started offering web search engines, but no one was quite sure what business models would work.

On the academic side, the National Science Foundation started a "Digital Library Project" which made grants to several universities. Two Stanford grad students in computer science named Larry Page and Sergey Brin worked on this project. Their insight was to recognize that existing search algorithms could be dramatically improved by using the special linking structure of web documents. Thus PageRank was born.

How Google uses data

PageRank offered a significant improvement on existing algorithms by ranking the relevance of a web page not by keywords alone but also by the quality and quantity of the sites that linked to it. If I have six links pointing to me from sites such as the Wall Street Journal, New York Times, and the House of Representatives, that carries more weight than 20 links from my old college buddies who happen to have web pages.

Larry and Sergey initially tried to license their algorithm to some of the newly formed web search engines, but none were interested. Since they couldn't sell their algorithm, they decided to start a search engine themselves. The rest of the story is well-known.

Over the years, Google has continued to invest in making search better. Our information retrieval experts have added more than 200 additional signals to the algorithms that determine the relevance of websites to a user's query.

So where did those other 200 signals come from? What's the next stage of search, and what do we need to do to find even more relevant information online?

We're constantly experimenting with our algorithm, tuning and tweaking on a weekly basis to come up with more relevant and useful results for our users.

But in order to come up with new ranking techniques and evaluate if users find them useful, we have to store and analyze search logs. (Watch our videos to see exactly what data we store in our logs.) What results do people click on? How does their behavior change when we change aspects of our algorithm? Using data in the logs, we can compare how well we're doing now at finding useful information for you to how we did a year ago. If we don't keep a history, we have no good way to evaluate our progress and make improvements.

To choose a simple example: the Google spell checker is based on our analysis of user searches compiled from our logs -- not a dictionary. Similarly, we've had a lot of success in using query data to improve our information about geographic locations, enabling us to provide better local search.

Storing and analyzing logs of user searches is how Google's algorithm learns to give you more useful results. Just as data availability has driven progress of search in the past, the data in our search logs will certainly be a critical component of future breakthroughs.

Google: Official Google Blog

Google Health, a first look



It's been a busy week for the Google Health team. Last week we announced our partnership and pilot with the Cleveland Clinic. This week, the team has been at the HIMSS (Healthcare Information and Management Systems Society) conference in Orlando, Florida, where Eric Schmidt gave the closing keynote. Eric's keynote marks the first time we've talked publicly about the product we've been designing and building. His talk also offered a deeper view into our overall health strategy. (Watch the video.)

Google Health aims to solve an urgent need that dovetails with our overall mission of organizing patient information and making it accessible and useful. Through our health offering, our users will be empowered to collect, store, and manage their own medical records online.

For the healthcare industry, online personal health records (PHRs) aren't a new idea and, in some cases, online PHRs already exist for patients. Here's what we think sets Google Health apart:
  • Privacy and Security - Due to the sensitive and personal nature of the data that will be stored in Google Health, we need to conduct our health service with the same privacy, security, and integrity users have come to expect in all our services. Google Health will protect the privacy of your health information by giving you complete control over your data. We won't sell or share your data without your explicit permission. Our privacy policy and practices have been developed in thoughtful collaboration with experts from the Google Health Advisory Council.
  • Platform - One of the most exciting and innovative parts of Google Health is our platform strategy. We're assembling a directory of third-party services that interoperate with Google Health. Right now, this means you'll be able to automatically import information such as your doctors' records, your prescription history, and your test results into Google Health in order to easily access and control your data. Later, this platform strategy will mean that you will be able to interact with services and tools easily, and will be able to do things like schedule appointments, refill prescriptions, and start using new wellness tools.
  • Portability - Our Internet presence ultimately means that through Google Health, you will be able to have access and control over your health data from anywhere. Through the Cleveland Clinic pilot, we have already found great use-cases in which, for example, people spend 6 months of the year in Ohio, and 6 months of the year in Florida or Arizona, and will now be able to move their health data between their various health providers seamlessly and with total control. Previously, this would have required carrying paper records back and forth. With Google Health, the user can simply import the data from each medical facility and then choose to share it with the other facilities. It's advances in data portability like this that we think can really make a difference in the quality of healthcare. The clearer and more comprehensive the information regarding your health becomes, the better your care will be.
  • User focus - We aren't doctors or healthcare experts, but one thing Google can create is a clean, easy-to-use user experience that makes managing your health information straightforward and easy. We're still iterating and testing our user interface, but here is what the welcome screen looks like:

    Here is a screenshot deeper in the application:
  • We're proud of the product that we've designed and are continuing to build, but recognize that we are just at the initial stages of our "launch early and iterate" strategy. We look forward to the feedback we will receive from our Cleveland Clinic pilot, from all of you, and from the initial users of our service when we make it publicly available in the coming months.
Update: Added link to video of Eric's talk; refreshed second screenshot.

Google: Official Google Blog

CustomizeGoogle: Improve Your Google Experience -- Firefox Extension

CustomizeGoogle is a Firefox extension that enhances Google search results by adding extra information (like links to Yahoo, Ask.com, MSN etc) and removing unwanted information (like ads and spam).

Firefox: del.icio.us/tag/firefox

Page 1 | Next >>