Computing Power on Tap

Imagine that every time you plugged in a toaster, you had to decide which power station should supply the electricity. Worse still, you could select only from those power stations that were built by the company that made the toaster. If the power station chosen happened to be running at full capacity, no toast. Replace the toaster with a personal computer and electrical power with processing power, and this gives a measure of the frustration facing those who dream of distributing large computing problems to dozens, hundreds or even millions of computers via the Internet.

A growing band of computer engineers and scientists want to take the toaster analogy to its logical conclusion with a proposal they call the Grid. Although much of it is still theoretical, the Grid is, in effect, a set of software tools which, when combined with clever hardware, would let users tap processing power off the Internet as easily as electrical power can be drawn from the electricity grid. Many scientific problems that require truly massive amounts of computation — designing drugs from their protein blueprints, forecasting local weather patterns months ahead, simulating the airflow around an aircraft — could benefit hugely from the Grid. And as the Grid bandwagon gathers speed, the commercial pay-off could be handsome.

The processor in your PC is running idle most of the time, waiting for you to send an e-mail or launch a spreadsheet or word-processing program. So why not put it to good, and perhaps even profitable, use by allowing your peers to tap into its unused power? In many ways, peer-to-peer (P2P) computing — as the pooling of computer resources has been dubbed — represents an embryonic stage of the Grid concept, and gives a good idea as to what a fully fledged Grid might be capable of.

Anything@home

The peer-to-peer trend took off with the search for little green men. SETI@home is a screen-saver program that painstakingly sifts through signals recorded by the giant Arecibo radio telescope in Puerto Rico, in a search for extraterrestrial intelligence (hence the acronym SETI). So far, ET has not called. But that has not stopped 3m people from downloading the screen-saver. The program periodically prompts its host to retrieve a new chunk of data from the Internet, and sends the latest processed results back to SETI’s organisers. The initiative has already clocked up the equivalent of more than 600,000 years of PC processing time.

The best thing about SETI@home is that it has inspired others. Folding@home is a protein-folding simulation for sifting potential drugs from the wealth of data revealed by the recent decoding of the human genome. Xpulsar@home sifts astronomical data for pulsars. Evolutionary@home is tackling problems in population dynamics. While the co-operation is commendable, this approach to distributed computing is not without problems. First, a lot of work goes into making sure that each “@home” program can run on different types of computers and operating systems. Second, the researchers involved have to rely on the donation of PC time by thousands of individuals, which requires a huge public-relations effort. Third, the system has to deal with huge differences in the rate at which chunks of data are processed — not to mention the many chunks which, for one reason or another, never return.

One solution is to pay somebody to solve these problems. In 1997 a company called Entropia was set up in San Diego, California, to broker the processing power of idle computers. Within two years, the company had 30,000 volunteer computers and a total processing speed of one teraflop (trillion floating-point operations) per second — comparable with many a supercomputer. Entropia is using this power for a range of applications, such as its “FightAidsAtHome” project, which evaluates research that could lead to potential AIDS drugs. Parabon, another processing broker, has a “Compute-Against-Cancer” project, which analyses patients’ responses to chemotherapy. For processing-power brokers, aiming at the medical business is a wise move. Bioinformatics tools and services are expected to grow into a multibillion-dollar market over the next five years.

But overnight fortunes are not going to be made from broking processing power. Ian Foster, a Grid expert at Argonne National Laboratory, near Chicago, cautions that “the business models that underlie this new industry are still being debugged.” United Devices, a company in Austin, Texas, which was started by David Anderson (who is the brains behind SETI@home), plans to offer incentives such as frequent-flier miles and sweepstakes to encourage membership of its commercial peer-to-peer projects. One broking service, Popular Power, has already disappeared and others are going the same way.

One challenge facing commercial P2P brokers, which will haunt commercial applications of the Grid as well, is philanthropy. In April, Intel launched a philanthropic peer-to-peer program that lets PC owners give their excess processor time to a good cause. The first software package that can be downloaded under this program, developed by United Devices in collaboration with a consortium of cancer research centres, is aimed at optimising the molecular structure of possible drugs for fighting leukaemia. While this may be good promotion for United Devices, it remains to be seen how many of these brokerage firms can turn a profit.

Safety in Numbers

If the P2P data-processing paradigm is the equivalent of assembling a ragtag army of poorly trained conscripts, the other end of the computing spectrum is a commando unit made up of a tightly knit set of PCs dedicated to solving a single problem. One of the pioneering efforts in this field was called Beowulf, after the hero of the legendary poem who slays two man-eating monsters. The original Beowulf, built in 1994 by Thomas Sterling and Don Becker at the United Space Research Association in Maryland, was a cluster of 16 off-the-shelf processors lined together by Ethernet cables. Its success inspired many others to build quick-and-dirty number crunchers out of cheap components — thus slaying the twin monsters of supercomputing, time and money.

Beowulf’s success depended not so much on the architecture of the computer — many commercial supercomputers are based on large arrays of fairly standard processors — but more on the price/performance ratio of the network technology. Supercomputers require proprietary “glue” to link their processors together, and this is both expensive and time-consuming to develop. Beowulf-like systems tend to use fast but affordable Ethernet technology. Some of Beowulf’s offspring, referred to as “commodity clusters”, now rank among the top 50 of the world’s fastest supercomputers — offering speeds up to 30 teraflops per second. They can often cost less than a hundredth of the price of an equivalent supercomputer. Cluster construction, which started as a nerdy hobby, is now a mature industry with turnkey solutions offered by manufacturers such as Compaq and Dell.

The intellectual link between Beowulf and the Grid is that, as transmission speeds on the Internet increase, clusters no longer need to be in the same room, the same building, or even the same country. In some sense, this is old news. A software system called Condor, devised by Miron Livny and colleagues at the University of Wisconsin in Madison during the 1980s, combined the computing power of workstations in university departments. With a Condor system, researchers can access the equivalent of a cluster of several hundred computers.

In a similar way, a number of European supercomputer facilities were linked together in the late 1990s as part of a project called Unicore that was run by a German research consortium. Using Unicore software, users can submit huge number-crunching problems without having to worry about what operating software, storage systems or administrative policies will be used on the machines that do the work.

Between them, SETI@home, Beowulf, Condor and Unicore all contain elements of what the Grid’s visionaries are after: massive processing resources linked together by clever software, so that, from a user’s perspective, the whole network melds into one giant computer. To emphasise this, the latest extension of the Unicore project has been dubbed E-Grid. To purists, however, this is only the beginning. They believe that Grid technology should blur the distinction between armies of individual P2P computers, dedicated commodity clusters, and loose supercomputer networks. Ultimately, a PC user linked to the Grid should not need to know anything about how or where the data are being processed — just as a person using a toaster does not need to know whether the electricity is coming from a wind farm, a hydroelectric dam or a nuclear power plant.

The Missing Link

Piecemeal solutions to building such a Grid are already at hand. A layer of software, called “middleware”, is used to describe the kind of tools needed to extract processing power from different computers on the Internet without any fuss. The most popular middleware so far is the Globus tool-kit developed by Mr Foster’s group at Argonne, in collaboration with Carl Kesselman’s team at the University of Southern California in Los Angeles.

The tool-kit contains programs such as GRAM (Globus Resource Allocation Manager), which figures out how to convert a request for resources into commands that local computers can understand. Another tool is called GSI (Grid Security Infrastructure) which provides authentication of the user, and works out that person’s access rights. One of the attractions of the Globus toolkit is that such tools can be introduced one at a time, and often painlessly, into existing software programs to make them increasingly “Grid-enabled”. Also, like the World Wide Web and the Linux operating system, the creators of the Globus toolkit are making the software available under an “open-source” licensing agreement. This allows others to use the software freely, and add any improvements they make to it. One example is Condor-G, an improved version of the Condor program that deals with the security and resource-management problems that occur when Condor is extended over institutional boundaries.

Daily use of the Globus tool-kit has proved it to be a robust standard. But it is not the only one. Another so-called “world virtual computer” project, which aims to deliver high performance parallel processing, has been under development at the University of Virginia since 1993 and has many of the Globus features already built in to it.

Yet another is the Milan (Metacomputing in Large Asynchronous Networks) project. The goal of this joint effort between researchers at New York University and Arizona State University is to create virtual machines out of a non-dedicated, unpredictable network of standard computers. The latest stage of this project, called Computing Communities, aims to make the underlying middleware adjust automatically to the device that is being used to gain access to the Grid, be it desktop computer or mobile phone.

As well as competition from other academic projects, Globus faces the prospect of being overtaken by commercial solutions. For instance, the programming language Java, which allows a software developer to write a program (more or less) once and then run it on Windows, Linux, Macintosh or any flavour of the Unix operating system, already does many of the things that the Grid hopes to accomplish. Java has yet to be made to run on different types of supercomputer, and there are various security and local-policy issues that the language is not equipped to handle. But this could change.

Another example is Microsoft’s DCOM software, which offers many Grid-like features, although there is talk of abandoning it. However, given enough support, one or other of these options could be transformed into a de facto standard. Already, Microsoft is integrating some of the Globus technology into the next generation of its Windows operating system.

Still, Globus and its various alternatives face big hurdles on the way to becoming a true Grid. To avoid computing bottlenecks, developers will have to figure out ways to compensate for any failure that occurs on the Grid during a calculation — be it a transmission error or a PC crash.

Yet another headache is latency — the delays that build up as data are transmitted over the Internet. The speed of light sets a limit to how fast electronic (or, indeed, optical) signals can travel. It takes about two-tenths of a second for light to travel halfway around the earth in an optical fibre, an aeon for an impatient processor. Smart software is needed to ensure just-in-time data delivery. Otherwise, the range of problems that the Grid will be able to deal with will be confined to the so-called “embarrassingly parallel”.

Such computations are carried out on different machines that do not need to wait for results from one another to proceed. This is much simpler to organise than the typical types of parallel processing run on commodity clusters, where the calculations have to move in lockstep, sharing information at regular intervals. It is even more primitive when compared with such advanced supercomputers as IBM’s Blue Gene, in which constant communication between processors is the core concept.

Challenging as this technical issue may be, more mundane problems could be a greater nuisance. Much to the chagrin of Grid purists, the system will probably have to include means for conducting virtual brokerage of computer power. This is going to be needed for accounting purposes, especially when commercial applications are involved. At the first Global Grid Forum in Amsterdam last March, Bob Aiken, a manager at Cisco Systems in San Jose, California, warned that the biggest challenges to the successful deployment of the Grid will be social and political rather than technical. Several academics have already tried to devise solutions to this problem — by incorporating some of the business tricks adopted by the peer-to-peer companies. But until there are large applications running on the Grid, such proposals remain literally academic.

Debugged by Science

As with the Internet, scientific computing will be the first to benefit from the Grid — and the first to have to deal with the Grid’s teething problems. For instance, GriPhyN is a Grid being developed by a consortium of American laboratories for physics projects. One such study aims to analyse the enormous amounts of data logged during digital surveys of the whole sky using large telescopes. The Earth System Grid is part of another American academic initiative. In this case, the object is to make huge climate simulations spanning hundreds of years, and then analyse the massive banks of data that result. Other initiatives include an Earthquake Engineering Simulation Grid, a Particle Physics Data Grid, and an Information Power Grid Project supported by NASA for massive engineering calculations.

Perhaps the most urgent example where a Grid solution is needed is at CERN, the European high-energy physics laboratory near Geneva. It is here, beneath the green fields straddling the French border, that the next generation Large Hadron Collider (LHC) will produce data at unheard-of rates when it starts running in 2005. The particle collisions in the LHC’s underground ring will spew out petabytes (billions of megabytes) of data per second — enough to fill all the hard-drives in the world within days.

Careful filtering will slow this prodigious rate down to a few petabytes a year. Even so, this is still too much for CERN to deal with on site. The data will therefore be disseminated through a four-tiered system of computer centres, allowing them to be spread over thousands of individual workstations in dozens of universities around the world.

Here, the Grid will come into its element. Repeatedly searching and retrieving large data sets from thousands of computers would be extremely inefficient. So researchers at CERN are looking at ways of processing the data where it is — allowing a physicist sitting at any participating workstation to talk to the rest of the complex web of computers and data storage systems as if they were in his own laboratory.

This raises a radical new concept of “virtual data”. The idea is that new data, derived from the analysis of raw data, should not be discarded after use but saved for retrieval, in the same way that raw data are stored. In a later calculation, the Grid’s software could then automatically detect whether a computation had to be started from scratch, or could take a short cut using a previous result stored somewhere in the system.

The European response to all the Grid activity in America is the DataGrid project co-ordinated by CERN. Here, the objective is to develop middleware for research projects in the biological sciences, earth observation and high-energy physics. As befits a project overseen by bureaucrats in Brussels, the search for an overarching Grid standard for many different science projects has become a leitmotif. Fabrizio Gagliardi, the project manager for DataGrid, despairs at the many Grid initiatives already underway. “If we each develop similar solutions,” he says, “it simply won’t work.”

A plethora of Grid standards is a real possibility. After all, even electricity has no worldwide standard of voltage or frequency. Much of the talk at the first Global Grid Forum was about giving Grid development a sense of common cause. But once the commercial potential of the Grid begins to dawn, standard-setting skirmishes will break out between companies and even countries.

While scientists gear up to use the Grid, the question remains whether — beyond the search for such exotica as Higgs bosons, running vast protein-folding calculations or simulating the weather — there is any real need for such massive computing. Cynics reckon that the Grid is merely an excuse by computer scientists to milk the political system for more research grants so they can write yet more lines of useless code. There is some truth in this. Many of the Grid projects running today resemble solutions in search of problems. And given the number of independent Grid initiatives, much of the work under way is going to be redundant in any case. But there is always the hope that the competition will breed a better class of Grid in the end.

A more serious criticism is that today’s Grid projects focus too much on storing and analysing large data sets, and making the sort of “embarrassingly parallel” calculations that only scientists need. These are obvious applications of the Grid. The real challenge is to figure out how to achieve the more ambitious goals that Grid enthusiasts have set themselves. Some of these are outlined by Mr Foster and colleagues in a paper that will appear shortly in the International Journal of Supercomputer Applications. The authors argue that the Grid will really come into its own only when people learn to build “virtual organisations”. Such a virtual organisation could be a crisis-management team dealing with an earthquake or chemical spill. In such circumstances, the Grid would be ideal for analysing local weather, soil models, water supplies and local demographics. It could even help with communications, allowing field workers to discuss problems with office staff by means of video conferencing.

In another guise, the virtual organisation could be an industrial consortium developing, say, a passenger jet. The Grid would allow the consortium to run simulations of various combinations of components from different manufacturers, while keeping the proprietary know-how associated with each component concealed from other consortium members. In both cases, several unrelated types of calculation, using independent data sets, and running on a range of computers in different organisations that may not fully trust one another, have to be threaded together to achieve a coherent answer.

It is precisely because such undertakings would be a nightmare to negotiate over the Internet that the Grid has come into being. Unlike neatly defined science problems, which amaze more by their petabyte scale than their inherent complexity, virtual organisations are noteworthy because of their constantly shifting landscape of data and computer resources, and the authentication on which they rely. This is where the Grid would have its greatest value. It is such problems that could lead to new business models — just as the Internet created the conditions for e-commerce. Unfortunately, this is where the Grid’s developers have barely scratched the surface.

Timing Is All

The idea of distributed computing is as old as electronic computing. When devised more than 35 years ago, Multics — a multi-tasking operating system for mainframes that was finally retired last year, and is a distant ancestor of today’s Linux — had many of the Grid’s goals in its original mission statement (operation analogous to power services, system configurations that were changeable without having to reorganise the software, and so on). Other Grid precursors abound. The point is that the enthusiasm for the Grid today is not the result of some fresh and sudden insight into computer architecture. It is simply that the hardware has now improved enough to make an old idea work on a global scale.

Two hardware developments are bringing the Grid within reach. One is the rapid increase in network speeds. Today, a normal modem connection in the home has the same data-carrying capacity as the backbone of the Internet had in 1986. The latest version of the Internet protocol (IPv6) used for sending data from one computer to another will make it easier to standardise video conferencing and remote operation on other computers — core components of a Grid service.

Hardware providers are already anticipating the needs of the Grid in their planning. One example is Géant, a European network with transmission speeds measured in gigabits per second that is being touted by a company called Dante in Cambridge, England. Such data networks will be the equivalent of the high-voltage power cables that criss-cross countries.

The second factor that is helping to make the Grid possible is the growth in power of individual microprocessors. This continues to follow Moore’s law, with processors doubling in power every 18 months. Today, even the humblest PC has enough spare processing power and storage capacity to handle the extra software baggage needed to run Grid applications locally. This is crucial, because no matter how clever the Grid may be, large amounts of computer code will still have to be transported to and from individual processors so that they have the tools to deal with any unpredictable task.

When the Grid really takes off, it will render obsolete much of the computing world as it is today. Supercomputers will be the first to feel the pressure — much as networked PCs consigned mainframe computers to the basement. Ignore for a moment the pronouncements by industry leaders and government steering committees. The best thing about the Grid is that it is unstoppable: it is just too good an idea to remain dormant now that most of the enabling technology is in place.

Like the Internet, the Linux operating system and countless other open-source endeavours before it, the key breakthrough that will make the Grid an everyday tool will doubtless come not from some committee in Brussels, Geneva or Washington, but from some renegade programmer in Helsinki or Honolulu. When? It could be any time during the next decade — perhaps even next week.