Data warehousing, a cluster of technologies that can deliver dazzling insights to decision makers and chronic headaches to IT staffers, has found a fertile new field of application: the World Wide Web. A small but growing number of companies are using “webhousing” to analyze the enormous volumes of traffic streaming daily through their Web sites.
Some of the most familiar names in data warehousing and analytic software — Hyperion, IBM, Oracle, Sagent, and SAS Institute, to name five — are applying their expertise to this burgeoning market. They’ve been joined by fledgling software providers that offer Web server log analysis, such as WebTrends and NetGenesis; and by CRM (customer relationship management) and E-commerce software vendors, such as BroadVision. As for the customers, dot-coms and media companies, such as CDNow, AutoTrader.com, and The New York Times Co., are leading the way.
What are these companies looking for? In the brick-and-mortar world, data warehousing analyzes trends in sales data and develops customer profiles. Transactions and profiles are important in cyberspace, too, but users of webhousing (sometimes referred to as clickstream data warehousing) are especially interested in patterns of online behavior.
They start with path analysis — how surfers navigate sites. Which pages do visitors choose, and which do they linger on longest? Web designers can use this information to redesign sites for maximum utility. Many companies are now advertising online; how can they measure the effectiveness of that investment? Through path and click-through analysis, E-marketers can determine which banner ads deliver the most bang for the buck. They can identify the best sites, and the best pages on those sites, for placing those ads.
Webhousing can calculate the value of affiliations — which portals and search engines deliver the best customers. And it can measure newfangled Internet metrics such as page views, click-throughs, and “stickiness” (the length of time surfers stay on a site). The new breed of webhousing tools, in short, makes online ventures “feel less like a shot in the dark,” says Rick Ratliff, director of the new media division at Detroit Newspapers.
Up Close and Personal
Companies like Detroit Newspapers focus on the online behavior of groups, not individuals, to determine content affinities. But when a Web visitor registers or buys something, a profile can be developed, and personalization becomes possible. “You can redecorate your Web site according to customer preferences,” says Michael Howard, vice president of the data warehouse program office at Oracle Corp., in Redwood Shores, California. Knowing the customer will make cross-selling and upselling feasible, and enable an E-business to tailor content and prices to its most valuable customers.
Online shopping-cart abandonment occurs more than 50 percent of the time a cart is used, according to Daniel Druker, general manager of Hyperion Solutions Corp.’s new E-Business Division. Why not send an E-mail to the would-be shopper and offer him a discount? Webhousing makes this doable. Admittedly, says Druker, some shoppers may be less thrilled by the savings than chilled by the surveillance. But “on the flip side, people are willing to have their shopping experience improved,” he notes. Druker believes that concerns over privacy will abate as people become more used to the Internet “as a pervasive medium.”
The benefits of personalization can extend to the business-to-business sphere: which products does a corporate customer typically buy via the Web? Meanwhile, the same tools that optimize external Web sites can be applied to internal sites, points out Howard; intranets and portals can be made more user-friendly for knowledge workers. “Your whole company becomes that much more intelligent,” he says.
That includes competitive intelligence. Analyzing Web-site traffic, companies can spot visitors from competing companies and identify the pages they view. According to an industry source, one technology company with a sizable Internet presence can reasonably guess when a rival is preparing to launch or upgrade a particular product. How? By observing the uptick in traffic from the rival’s domain to the Web pages for the company’s own, corresponding product.
The essential trick of webhousing is to retrace the paths taken by individual Web visitors — to “sessionize” the raw data downloaded from Web servers. Like walking on one’s hands across a football field, this is a straightforward task in theory, but difficult in practice.
Here’s why. A Web server records hits, or requests for data, in a log. Clicking on a home page may result in 5 hits — four images plus HTML text — and five rows of log data. Clicking on another page on the site might result in another 10 hits. According to IBM, the average number of hits per page view is 5, and the average number of page views is also 5. That’s bad enough.
It gets much, much worse. If, say, nine other surfers are using the Web site at the same time, the record of those original five hits is interspersed with data for the other visitors; one might have to comb through 200 or 300 pages of server log data to reconstruct the single page view of one visitor. Multiply these kinds of numbers by the amount of traffic a busy Web site gets, and the result is “a combinatorial explosion,” says Lou Agosta, senior industry analyst at Giga Information Group and author of The Essential Guide to Data Warehousing (Prentice Hall).
“If you were to capture every page a visitor clicks on, you’d get potentially billions of hits a day,” says Agosta. “Certainly millions, if you’re a moderately busy site.”
Webhousing technology filters these enormous server logs into manageable — and meaningful — information. “You can’t restitch that data; you have to reconstruct the visit,” explains John Payne, solutions executive for IBM Corp.’s SurfAid Analytics, a Dallas-based outsourcing service. SurfAid reconstructs Web-site traffic with the help of proprietary software and IBM RISC 6000 computers, powerful multiprocessor machines that perform massive sorting routines on log data.
Some companies may decide to focus only on customers who can be identified, thus reducing the size of the filtering task. And they may simplify matters further by making a “judicious selection” of pages visited, says Agosta, such as product pages, customer registration pages, and FAQ pages. (Webhousing data can also come from cookies and registration forms.)
SurfAid sorts Web-site visitors by Internet protocol (IP) address, but it also offers technology to identify visitors if so desired. For set-up costs and a base monthly fee of $750, companies can transmit their daily Web server logs to Surf-Aid, which stores the reconstructed data. The service gives customers passwords to gain access to their data via SurfAid’s extranet, and provides online data-mining tools for analysis.
Power of the Press
One SurfAid customer is the new media division of Detroit Newspapers, a joint operating agency for the Detroit Free Press and the Detroit News. Each newspaper Web site generates between 7 million and 9 million page views per month. “What do we want to find out? What paths people are following through the sites,” says Detroit Newspapers’s Ratliff. “From that, it’s often possible to draw some conclusions about what their interests are.”
For instance, a large group of readers may go only to the Food section, bypassing the front page. “This suggests that this is a target audience that a Procter & Gamble or Kraft Foods might want to know about,” says Ratliff.
Detroit Newspapers began working with Surf-Aid this past May. Previously, it had used an off-the-shelf product designed to analyze Web server logs, but Ratliff says the software “took a long time to crunch data, then only showed it to you in a single format. You had to recrunch the data to get other information. We were left in the dark.” By contrast, SurfAid created layers of information, allowing users to move easily between different views and identify traffic patterns, including the 10 most common ones. “It was like a veil being lifted,” says Ratliff.
During one week in October, visitors to the Detroit News Web site viewed 2.2 million pages, including 210,000 in the Features section and 743,000 in Sports; by contrast, the home page was viewed 264,000 times. Only 6 percent of all visitors looked at the Auto pages, but a subset of those visitors can be identified as car buffs, based on the relative amount of time they spent on those pages. “We can try to improve their [ad] click-through rate,” says Ratliff. Their site paths can be isolated, and auto ads can be placed accordingly.
Examining Web traffic patterns by day, week, month, or overall, the two Detroit newspapers now have a tool for targeting ads, and for persuading advertisers to advertise. Their editors can get direct feedback on readers’ preferences.
And then there’s the kind of discovery that only second sight or business intelligence software makes possible. According to Ratliff, the parent company of one of the newspapers had assumed that a section of the Web site would be visited primarily by children, and therefore was a waste of time; few advertisers would want to spend much money on that demographic group. However, says Ratliff, “we discovered this section was getting more impressions than the Business section” — and most of the visitors were coming from corporate domains. The section received a last-minute reprieve, and the paper started to target ads for it.
One Billion Impressions
Detroit Newspapers uses Web-site analysis to help it sell online ad space. Autoweb.com, a leading consumer automotive Internet service based in Santa Clara, California, uses webhousing to help it do a better job of buying online advertising, as well as attracting ads and corporate sponsorships.
Autoweb.com makes money, if not yet profits, by “monetizing the visitor,” says CFO Thomas Stone. (“We could be making money if we needed to,” comments Stone. “But do you become profitable and lose out to your competition?”) This means, principally, forwarding visitors’ purchase requests to Autoweb.com’s network of more than 5,000 dealers, for a dealer-paid referral fee. The dealers have signed agreements to treat these requests “in a certain way,” says Stone. That means, basically, offering a fair price without hassling customers.
But not everyone comes to the site to buy a car. Many are seeking information on models and prices, which Autoweb.com provides. The company monetizes those visitors by offering their eyeballs to advertisers. Autoweb.com also teams with third parties to offer financing, insurance, and warranties. It even hosts auctions. “We have a very robust revenue model,” says Stone.
Still, Autoweb.com’s ambitions will be for naught if it can’t make its existence known to the world. To lure customers, it buys ads on Internet portals, such as Yahoo Inc. and America Online Inc. Autoweb.com has a “tremendous investment in online advertising,” says Stone. “The number of impressions we have on Yahoo and AOL is mind- boggling” — more than a billion to date.
To understand that investment, Autoweb.com relies on three webhousing technologies. AOL and Yahoo use software from NetGenesis to collect and analyze Web log data relevant to Autoweb.com ads; Stone can access that data via the portals’ extranets. He can also use software from NetGravity on the extranets, which the portals make available for administering advertising campaigns. The company downloads portal data into its Hyperion Essbase data mart, along with data collected locally, for analysis.
With this setup, Stone can monitor the company’s ad campaigns running throughout the portal properties — calculating the click-through rate for ads in various venues, and shifting ad placements appropriately. The company also needs to be able to quickly recalculate the cost of those shifts, since placements vary in price according to the portal venue. (Click-throughs for car ads are assumed to be more valuable on a car-buying site, since the visitor is presumably more serious about the product.)
In the end, says Stone, webhousing not only improves Autoweb.com’s return on investment from advertising, it helps the company better manage its relationships with Yahoo and AOL. “We’re very much partnerships,” he points out. “We need to understand our online ad placements at a detailed, powerful level.”
That kind of understanding doesn’t come cheap. Consultant Agosta says a company can easily sink half a million dollars into a clickstream data warehouse, once the costs of software, hardware, consultants, and integration work are added up. Companies should insist on some sort of proof of concept from vendors, such as a prototype warehouse with sample log-data analysis, before committing themselves.
After all, conventional data warehousing has been plagued over the years by project failures, and there’s no reason to think that webhousing will be immune. Rumor has it that at least one E-business has pulled the plug on a disappointing webhouse. Apparently, even dot- coms know when it’s time to cut their losses.
The prevalence of Web traffic analysis has alarmed privacy-rights advocates. Last month, several groups asked the Federal Trade Commission to investigate online profiling, as conducted by Internet ad companies such as DoubleClick, Engage, and 24/7 Media. These firms use IP addresses and cookies to identify surfers and track their movements from site to site. Many, if not most, surfers are unaware they are being followed. “There’s nothing wrong with the goal of trying to figure out what a market is,” says Andrew Shen, policy analyst at the Electronic Privacy Information Center (EPIC), in Washington, D.C. “What’s wrong is the means.”
Acknowledging the outcry, 10 leading online ad companies, including the 3 named above, announced they would form a self-regulatory group and work to give surfers the ability to opt out of profiling activity. But that probably won’t be enough to allay the concerns of groups like EPIC, which calls for legislation to establish fair information practices in cyberspace, according to Shen.
Autoweb.com CFO Thomas Stone sees a potentially vast benefit from online profiling, but doesn’t think the practice should override consumer choice. “In every interaction [on the Web], consumers must be given the choice of whether and how to participate,” says Stone. “At the end of the day, our business model — and all successful business-to-consumer commerce on the Internet, in fact — is about leveraging an empowered consumer.”