Data Center Software: Progress Without Profits

“When my hair gets long, I kind of look like him.” Matei Zaharia jokingly evades the question about what he thinks of being compared to Bill Gates. But the 30-year-old Romanian-Canadian computer scientist is indeed reminiscent of Microsoft’s former boss in his early days: he is considered one of the most brilliant geeks of his generation; he has developed an exciting new technology, called Spark, to crunch data; and he is one of the founders of a promising startup, Databricks.

Yet in an important way the two men are different: Zaharia has no interest in making billions. After spending two years helping get Databricks off the ground, he has recently reduced his involvement with the firm and become a professor at the Massachusetts Institute of Technology. “I like to work on long-term and risky projects — something you can do at university, but not at startups,” he explains.

This combination — progress without profits — makes Zaharia a poster-child for a branch of the IT industry that is crucial to many websites and apps, but gets much less attention than the latest smartphone or the next social-media sensation. Databricks, and a bunch of other startups, provide software that makes data centers run more efficiently and lets them handle vast amounts of data.

The operators of the biggest data centers, such as, say, Amazon or Facebook, already have this sort of software installed. But the next stage, which the startups are concentrating on, is to make it sufficiently user-friendly for non-tech businesses. The sector is improving rapidly, but may never make anyone filthily rich — even those who are keener than Zaharia on money.

The model for this sort of software was virtualization, the idea of splitting a computer into several “virtual machines,” each with its own operating system and programs. Originally developed for mainframe computers, virtualization became popular in the late 2000s as a way of making corporate data centers more efficient by spreading work around servers that were being under-used. The company that pioneered this, VMware, has grown rapidly.

A startup called Docker is now seeking the same sort of success with “containerization.” It slices big and complex online applications into more manageable parts, which can be handled separately, meaning that small teams of programmers can focus on improving the code in one container. Upgrades can be installed at any time without the need to wait until a new version of the entire application is ready.

Since containers make developers much more productive, Docker’s software has proved hugely popular. The firm claims there have been more than 800 million downloads since the program was first released in March 2013. Some of Docker’s customers have already made it an important part of their software supply chain. Gilt, an e-commerce site, for instance, used containers to cut seven big applications into 400 smaller ones.

“Orchestration” is perhaps the most important addition to this class of software. Whereas virtualization carves up one computer into many, orchestration makes a bunch of machines (a “cluster”) look like one big computer by moving containers around between them. In July Google made public its version of the technology, called Kubernetes, so others can use it. CoreOS, another startup, has added a version to its software package. Mesosphere, which makes an operating system for data centers, has integrated Kubernetes into it.

Firms not only need help with running and updating their applications, they also need assistance in managing their growing piles of data. This is the remit of Hadoop, a database program with an accompanying set of number-crunching tools. The package also originated at Google, but is now marketed by firms such as Cloudera and Hortonworks. The software lets companies create and analyze “data lakes,” vast repositories for all kinds of information.

Sifting through these digital waters is often slow, which is why Databricks’ Zaharia developed Spark, a sort of spreadsheet for big piles of data. It allows these to be handled in real time as the information comes in, for instance, from websites and sensors. Although it is only a few years old, Spark has already attracted a following of hundreds of developers and users. In June IBM announced that it would put its weight behind the software.

Their popularity notwithstanding, it is not clear whether these startups will ever become good businesses. In contrast to software firms in the past, they will not make money by selling copies of their programs. In most cases these are “open-source,” i.e., the software’s creators publish the source code, so anybody can work with it. They do so out of a mixture of altruism and a belief that the product, and thus the market for it, will develop quicker if people are free to collaborate on it.

Instead, the startups are looking to make a living by charging for add-ons of various sorts. Hortonworks offers subscriptions for things such as troubleshooting and updates. Docker and CoreOS are planning to charge for management tools. Databricks has turned Spark into a set of web-based services which, for instance, allow subscribers to visualize data.

Even so, it will be tough to survive in competition with giants like Amazon and Microsoft, which are offering comprehensive cloud-computing services. What is more, many potential customers may prefer to get all the pieces from one firm rather than stitching together software from startups, says Simon Crosby, a virtualization veteran who works for Bromium, a provider of online-security software.

The current plethora of data-center-software startups is likely to shrink as they run out of venture capital. Some firms will be gobbled up by established software firms, such as VMware and Red Hat. But they may be remembered fondly by data-center managers for having made computing cheaper, faster, and more flexible.