Google Web Services

Tuesday, September 30, 2008

Your Call to Action

If you've read the entire chapter, you know what a Web service is, how the Google Web Services fits within the general definition of a Web service, and what you can use the Google Web Services to do. You can use this knowledge to create opportunities to exercise Google as a search engine for all kinds of tasks. Data mining is increasing in importance as companies strive to gain more from the resources of the Internet—this technique accesses the needed information and discards unneeded information. At this point, you also have a machine that's setup to create a Google Web Services application of some sort and you have the Google Web Services Kit installed.

The next step of the process is to evaluate where you're going based on the content of this chapter. You need to consider what you want to do with the information Google provides, how you plan to present it, your own capabilities, and the capabilities of the person using your application. This may sound like a lot of work, but it's important to create a firm foundation for your application. When you take these preliminary steps, you begin thinking about problems and solutions to those problems.

Chapter 2 builds on the knowledge you gained in this chapter. The emphasis of Chapter 2 is on data mining—the process of using specialized search techniques to whittle search results down to just the links you need. In many cases, you can also access these search techniques using Google search page, but the emphasis of data mining is automation. Only by using the Google Web Services can you automate the search process and then display the results in the form you want, rather than rely on Google's formatting methodology.

Defining Static and Dynamic Data

Web applications can include the concept of static and dynamic data. Dynamic data is the best type to use for Web services because it reflects changes in the Google database. An application gains important benefits by using dynamic data. For example, you won't try to access an old Web site that Google used to list because the dynamic nature of your application automatically removes the link from the list of results.

Unfortunately, dynamic data can also cause problems. For one thing, you need a connection to the Internet to work with dynamic data. When you use a desktop machine, maintaining a connection usually isn't a problem. However, many third party developers are working on applications where a connection might not be available, such as a research list application for a PDA. You download the information from Google Web Services and then use it to create a report while on the road—the connection doesn't exist while you're on the road so the data is no longer dynamic.

Note Google does support the concept of cached data that is stored from the original Web site, but even the cached data ages at some point and becomes unavailable. Cached data does have an important use. For example, you can use cached data to obtain copies of old articles that a Web site no longer carries.

Using the term dynamic to refer to application data is also somewhat of a misnomer. Nothing is truly a dynamic data application. The moment the response to your query leaves the Google server, it begins to age. The data doesn't change once it leaves the Google server, so in reality it isn't truly dynamic. The only way you can achieve a dynamic presentation of sorts is to make multiple queries. You must define how often is often enough for your needs. Google doesn't provide any guidance in this case because search result viability varies by person, focus, and need.

These facts lead into the discussion of static data. Truly static data never changes at all. Most Web sites still rely on static data presentation because the information they display doesn't change often enough to warrant a dynamic presentation. When you make a single query to Google Web Services, the response you receive is static data. It's a snapshot of that particular part of the database at a specific time. The data won't change unless you make another query.

Understanding the static and dynamic nature of data is important when you design an application that relies on Google Web Services. Errors creep into the presentation you create as the data from Google Web Services ages on your system. Part of the design process for your application is to determine how much error you can accept.

Making Sensible Queries

Google Web Services can help you perform a number of tasks. The problem is that each request and response consumes resources. To get the most from this Web service, you need to optimize the requests and responses so that the value of the information you receive exceeds the cost of transmitting and manipulating the data.

Creating a request and then handling the response has several costs associated with it. Some of the costs are real world in that you must provide the infrastructure required to perform the task. Inefficient queries could mean adding additional bandwidth capacity or providing additional servers (if you make enough queries). Some costs are employee related—inefficient queries mean more waiting time as the computer crunches the data. Finally, inefficient queries can incur intangible costs. For example, people can become frustrated with poor query results, which affects their performance. Some of these costs are impossible to measure accurately, but they're real.

I often rely on the online search engine to help tune queries. Using the Advanced Search (http://www.google.com/advanced_search/) page shown in Figure 1.1 can help you define and tune searches to obtain maximum data with minimal resource use. For example, computer technology is quickly outdated, so I normally provide a date range as part of my search. Using the online search to customize the date range for specific keywords can greatly enhance performance.

The Advanced Search page can help you tune keyword order—making it possible to reduce the number of keyword permutations you use on an expanded search (see the "Conducting an Expansion Search" section of the chapter for details). It also helps you decide on how to use permanent keywords. For example, a site that sells a specific product might include the product name as a permanent search term—one that is always included even if the user doesn't specify it.

One of the features the kit provides is a more complete list of special phrases and characters you can use for a search. Although the Advanced Search page will help you ferret out many of these search features, you won't find them all. For example, the Advanced Search page includes a blank for a search phrase, which is different from a keyword in that the search phrase must appear as specified on the target page (keywords can appear in any order). Fortunately, all of the special keywords and characters mentioned in the kit also work on the Advanced Search page so you can try them out. (Chapter 2 discusses search techniques in detail.) Depending on which special features you use for a search, Google Web Services output might not provide the information you imagined. Consequently, it pays to try these special features out to see what effect they have on your search results.

Tip You might wonder why I'm suggesting such heavy use of the Advanced Search page. Google allows you to make 1,000 requests per day using Google Web Services. The Advanced Search page doesn't have such a limit—making it easier to keep testing search techniques until you find the technique you want to use in your code. At that point, you can start making requests from Google Web Services to test your code. Don't waste calls on search techniques.

Even when you create a perfect search and properly filter the results using code, the information you receive from Google Web Services might not fulfill every need. At some point, you need to perform some level of human filtering. Users will need to state a preference or define how well a particular search result works. Only by tuning the filter can you hope to obtain specific results from Google Web Services. Tuning makes it possible to reduce search times from hours to minutes.

Limitations of Google Web Services Output

Many developers are used to working with a variety of data types when creating applications. Data types help define the kind of data you're using. For example, if a data element is a number, you might use an integer (a number without a decimal) or real number (one that has a decimal and equates to a Single or Double for Visual Basic developers). A Web service has no concept of data type when it comes to the data itself. Every data transfer is text. The XML used to transfer the data does include type information, but of the sort that's normally associated with database fields, which means you have to know the field names to make an interpretation. For example, you might receive data in a message like the one shown here.

12k

... some text highlight) more text ...

True

http://www.mwt.net/~jmueller

<b>DataCon Services</b>

You don't have to understand the XML portion of this message segment, but look at the data. Google Web Services sends all data as characters (as do all other Web services) and defines the data using tags (the words between the angle brackets) and attributes (extra information within the tag). For example, the line that contains http://www.mwt.net/~jmueller includes the tag that tells you that this value is http://www.mwt.net/~jmueller and that the tag type is an xsi:type=“xsd:string”. The tag tells you what kind of information this is. By knowing the Google database layout, you also know the data type and other information about the entry. However, the information you receive from Google is still plain text. You can see other examples of XML responses in the \GoogleAPI\soap-samples of the kit. Simply open them using Internet Explorer or another browser that supports XML

Your browser is actually very handy for viewing XML data, even if it might not make sense right now. The "Viewing XML Data in Your Browser" section of Chapter 3 discusses in detail how you can use your browser. For right now, all you need to know is that you can look at the various kinds of XML responses by opening the files in your browser.

Figure 1.5 points to another potential problem with Web service output. All of the tags and other information supplied in a request and response consume space. The file is larger than a text file with the same data because of all the tag information required. In addition, it's far more efficient to store many data types in their native format, rather than use characters. Consequently, Web service data suffers from bloat. The data uses more bandwidth than a binary message and consequently, you could experience performance problems. Because of this issue, you need to create efficient queries for your application that maximize data throughput despite the limitations of the XML format. The "Making Sensible Queries" section of the chapter discusses this issue in detail.

The results you obtain from Google are largely a matter of the input you provide in the form of a request. The "Conducting an Expansion Search" section of the chapter points out a serious flaw in making any assumptions about the return you receive from Google. The query can become quite complex because even the order of the words makes a difference in the results you receive. Google must make this assumption because most people enter the words in the order they think about them, which is usually most important to least important. Consequently, if you always assume that your first query returns all possible results, you'll find Google Web Services disappointing.

The ranking of results you receive from Google Web Services is also unlikely to be the same as the ranking you need. Google sells keywords to make some sites turn up higher in the result list. In addition, Google often bases the site ranking on criteria that won't match your own, such as the number of times that a keyword appears. The bottom line is that the output you receive from Google is "raw" output—information that you haven't filtered or organized in any way. One of the reasons to use Google Web Services is to enable you to perform tasks such as site ranking so the results appear in the order that's best for your organization.

Knowing What to Expect as Output

For many developers, the idea of a Web service is easy to grasp—knowing what to expect from it is hard. The output begins with a certain amount of raw data that you'll receive from the Web service. However, the raw data doesn't really define the Web service output completely. You also need to consider quantifiable components such as the input to the Web service and that data manipulation you'll perform. In addition, there are variant elements to the output, such as the timeliness of the data. Finally, you need to consider the intangible elements. The output has some value to you, but someone else will view the output in another way. Concepts such as relevancy are difficult to quantify or even define.

Google Web Services is no different from any other Web service when it comes to output. You'll provide input, receive raw data, manipulate that data in some way, and view the output—the result of everything you have done with the Web service. The following sections discuss various elements of Web service output as they relate to Google Web Services. These sections provide an overview—the book continues to explore the subject in other chapters. However, this is the starting point—the point at which you start to consider what to expect as output from your efforts.

Emulating the Real World

Developers often live in a laboratory. In the laboratory, everyone has the proper equipment, fast machines, and an even faster connection. The user never disconnects unexpectedly and always knows how to get the most out of their computer. The problem with the lab is that it doesn't model the real world. In the real world, users get bored, try odd key combinations just to see what they do, don't understand their computer very well, but do know how to complain about the smallest application problems. If you want to avoid problems with the application you develop, you need to create a development environment that models the real world.

It's also easy to get lost in the development environment setup. Make sure you understand the person who uses your application. For example, it's quite possible that only desktop users will have any interest in your site on desktop machine maintenance, but you need to determine that fact in some way (online surveys work well). You also don't want to spend a lot of time testing the application to meet the needs of users who have no use for your product. Again, surveys and newsgroup polls are helpful in determining the real world environment that you must emulate with your system.

Using Multiple Test Devices

If your application will appear on the Internet, you need to test using multiple devices. It's no longer safe to assume that only desktop users will have an interest in your application. You might attract Personal Digital Assistant (PDA) and cellular telephone users as well. This is especially true of a Web application that helps users find a particular kind of information quickly. People often rely on these applications when time is tight and they don't have time to look for a product themselves.

Note Not every developer is concerned about writing applications for every platform—sometimes it's a matter of time; other times it's a matter of skill or perceived need. When an application you write falls into this category, you can still provide a modicum of support for wireless users by directing them to Google Wireless Services at http://www.google.com/options/wireless.html./

It would be nice if everyone could afford to test every application on every device, but that's not realistic for the developer. Sometimes you need to use an emulator to perform the testing because you don't have the real device handy. Fortunately, you can find a vast array of useful emulators on the Internet—everything from the Pocket PC to cellular telephones of all types. Emulators have limitations, but they do make good test devices in many cases. We'll discuss the advantages and concerns of using emulators in the "Working with Emulators" section of Chapter 9.

Sometimes it also helps to have multiple desktop machine setups. For example, you might need to consider how a Web page looks and acts in Netscape versus Internet Explorer. (Theoretically, you can run both browsers from the same machine, but doing so causes interference problems that some developers find distasteful.) Differences in how the browsers react to specific Web page designs could cause problems in your application. In some cases, you'll need multiple machines to perform this kind of testing. For example, you might need to consider how the application looks on a Macintosh versus a PC if your application has broad enough appeal. Obviously, you can still write Google Web Services applications if you don't have a multiple machine setup, but having more than one machine does make development tasks a lot easier and less error prone.