Data Operations

The Key to Scalable Insights in Digital Publishing

Digital Media has undergone a fundamental shift where data now drives our understanding of inventory, demand, and profitability. This has put a premium on new tools and techniques to measure performance. Yet data complexity, volume, and engineering challenges continue to block even the most intrepid business teams. The emerging science of Data Operations (DataOps) has been evolving to meet this challenge, and publishers now stand at the crossroads of data and efficiency. Either you master your data so your business can evolve based on quantified insight, or your data will keep your teams in an ever-lengthening cycle of manual reporting.

This paper was written by the founding leaders of Google BigQuery (the cloud-based enterprise data warehouse used by many of the Fortune 500), who are now co-founders of Switchboard Software. With our extended team at Switchboard, we bring a wealth of experience building massively scalable and fault-tolerant data platforms. We’ve also worked with the largest publishers and agencies to implement real-time decision systems using cost-effective big data platforms, such as BigQuery and Amazon RedShift. This paper was created to help digital media professionals learn:

  • How Data Operations can unify and automate the measurement of monetization, inventory, audience reach and content performance in realtime
  • The most important data sources and formats publishers must tie together to create timely and accurate business KPIs
  • The best technologies to help you consolidate disparate data in a scalable manner, and the strengths and weaknesses of these approaches

Access to growing amounts of data should be an asset, not a liability. Publishers have an unprecedented opportunity to grow their business by taking a scalable, automated approach to managing their data. We hope you find this paper helpful and we welcome your thoughts and feedback at sales@switchboard-software.com.

New Data Operations Challenges in Digital Media

The data landscape for publishers is becoming increasingly more complex. Revenue is no longer driven by a single DFP instance to be managed by a lean AdOps team. Programmatic sales, mobile, and partner networks provide new revenue channels, but they fragment your view of inventory, revenue and yield. Advertisers may be willing to pay a premium to reach specific audiences, but only if they can be assured that those demographic groups are viewing their campaigns.

Revenue and AdOps teams are often burdened with requests to pull raw data manually and stitch that data painstakingly together into one-off reports. The intense pressure caused by the need for operational visibility creates a dangerous stumbling block. While at Google, we worked with some of the world’s largest publishers and ad agencies to enable real-time insights from a rich mix of media data. We saw firsthand how difficult these data challenges can be.

Forward-thinking publishers know they need timely data insights to effectively grow viewer loyalty, satisfy advertisers and optimize yield. Publishers that do not use data effectively today risk being outflanked by more nimble competitors tomorrow, or drowning in a sea of disparate data—unless they have the right strategy and the right tools.

A scalable data operations solution delivers continuous visibility into critical KPIs like Sell-Through Rates, Campaign Delivery, and User Growth. These insights help you become aware of emerging trends, rapidly understand their causes, and use those insights to create new opportunities for growth and profitability. So, how can you use data operations to empower your publishing business?

The Path to Real-time Insights

Data Operations is an approach that combines processes for collaborative data management, tools for automation and monitoring, and a scalable architecture to ensure that data growth is an asset, not a liability. Take a quick look at the sidebar “What is Data Operations” to understand what we mean by this term.

Data Operations alleviates the manual efforts that sap the productivity of your data analysts. While you may have in-house talent to build your own data automation, you risk being burdened with the overhead of monitoring and maintaining proprietary infrastructure, and keeping up with changes to third party APIs and data formats. Wouldn’t it be better instead to outsource this highly-specialized task and let key employees focus on value-added projects that individualize your business?

There are four fundamental steps to realizing the benefits of Data Operations:

  1. Identify: KPIs that will help you measure and improve performance
  2. Normalize: Extract and refine Raw data sources into trusted “Foundational Data”
  3. Transform: Join and enrich Foundational Data to create distinct KPIs
  4. Automate: Choose the right tools to automate data operations and real-time reporting

Step 1: Identify Your KPIs

A Data Operations approach aims to provide value by enabling organizational data collaboration. Unfortunately, in their rush to set up metrics, organizations will often base their choice of measurements on whatever data is available or is easiest to present. This inevitably results in arduous, manual work when it comes time to provide higher-level insights. Developing a Data Operations strategy starts not with technology, but with organizational communication. Your initial goal should be to recognize, at a high-level, specifically what metrics you want to measure. At the outset, it’s essential that you target questions that will yield the insights you need to drive key decisions. We’ve noted below some common questions and KPIs common to many publishers.

COMMON BUSINESS QUESTIONS

Core business questions, such as:

  • Where are we likely to have sell-out or unfilled inventory over the next 30 days?
  • How are monetization rates changing across our verticals over time?
  • Which programmatic deals bring us the most revenue and highest eCPM?

Next-level questions, such as:

  • How can we ensure the most valuable inventory with highest viewability is sold first?
  • What are our most valuable audiences in terms of viewability and sell-through?
  • What is the optimal way to optimize our sales between programmatic versus direct?

CORE KPIS TO ANSWER THOSE QUESTIONS

Here are common examples of core KPIs that can be enabled by Data Operations:

Common KPIs and DataOps

While it’s certainly possible to generate these metrics with manual techniques, you risk burning valuable talent on repetitive and error-prone tasks: logging in to numerous tools and user interfaces, downloading, scrubbing and wrestling unstructured CSV files into rudimentary spreadsheets. Data Operations automates these tedious processes, so questions that used to take weeks to answer can now be handled in as little as a few minutes.

Step 2: Refine Raw data into trusted “Foundational Data”

Once your data collaboration goals and key metrics are established, the technical work of producing value from your data begins. Various teams in an organization require different combinations of metrics, and raw data accessed directly from APIs or log files rarely provides a format useful for collaborative analysis. An emerging pattern to address this issue is to continuously and automatically transform raw data streams into a format useful to derive KPIs.

FOUNDATIONAL DATA: the BASIS FOR TRUE KPIs

Just as a jet aircraft requires highly-refined aviation fuel to achieve its full potential, similarly the KPIs that drive your business decisions require data that is high quality, well-understood and highly reliable. Unfortunately, raw data that comes from vendors and third parties can be anything but. No two APIs are alike. Data must be cleaned, typed and sometimes enriched with match tables. Data formats change. Connectivity or vendor hiccups present the ever-present risk of data loss or data corruption.

Building meaningful KPIs directly from such data is impossible without taking on an enormous amount of complexity. Systems built upon such data start out brittle, eventually collapsing under their own weight. With a such large gap between raw data and KPIs, experienced teams invest in an intermediate concept that is critical for success—foundational data. The key idea is to take each data source and normalize its data into standardized and canonical versions. Ideally, those that can be easily combined with other, similarly refined data sources.

COMMON DATA SOURCES PUBLISHERS NEED TO NORMALIZE

First, start with the data sources. Within most organizations, the list of data sources you need to master is clear. However each incremental data source adds new complexities. By understanding the distinctive properties and challenges presented by each data source, you’ll be able to make a more informed, tool selection decision based on the unique profile of your business.

Data and Integration Challenges

Disparate data sources, formats, and integration challenges will sap any budget, so what are publishers supposed to do? Rather than attempt to on-board every single data source for its own sake, try to understand the data characteristics of your business today, and where it will be tomorrow. This will help you understand how your raw data can evolve to become the foundation for the real metrics you need to succeed.

CREATING FOUNDATIONAL DATA—AN EXAMPLE USING DFP

Let’s say you’re most interested in creating foundational DFP data—because as your primary ad server, DFP’s data can provide a rich view of how display and certain video inventory is being delivered. Start with the DFP API. For brevity, we’ll assume that you’re already familiar with its quirks and limitations.

The first step is to determine the appropriate queries and granularity of data required. An important consideration is identifying the dimensions you really need as there are quota limits. Next is to use a script or a tool to invoke the API, extract and store the query result. It’s important to do this with 100% consistency so that query results maintain the same schema. Each row in the query result needs to be type-checked, i.e., numeric values must be cast into integers or floats if they are to have any value for calculations. Dimensions must be normalized in order to avoid textual inconsistencies, the result of occasional human input error, that can also throw off calculations. Finally, the query result needs to be written either to a file or, preferably, to a data warehouse, so that it can be consolidated for query-ability.

Additional considerations include how to extract key-values so that the business attributes captured in custom dimensions can be extracted for analysis, similarly for historical backfill. Lastly, consider if you need Data Transfer, the event-level server logs that can provide the finest possible granularity and level of insights.

The steps above are the abbreviated set of tasks involved in creating foundational data for a selected data source. The processes involve numerous data cleansing tasks (file encoding, to name just one example) that go beyond the scope of this paper.

Step 3: Transform Foundational Data into KPIs

Once your data collaboration goals and key metrics are established, the technical work of producing value from your data begins. Various teams in an organization require different combinations of metrics, and raw data accessed directly from APIs or log files rarely provides a format useful for collaborative analysis. An emerging pattern to address this issue is to continuously and automatically transform raw data streams into a format useful to derive KPIs.

GOING FROM RAW DATA TO KPIS REQUIRES A FEW STEPS

Once you’ve established a clear list of KPIs to measure performance, and marshaled the raw data from various systems and vendors, how do you connect these two concepts? In many cases, crucial KPIs must be derived from different data sources with unique schemas and formats.

To fully realize how the whole value of all your publishing data is greater than the sum of your individual channels, each data source must be prepared based on its unique properties. DataOps automates many styles of data preparation into a single process that transforms raw ingredients from any data source to create “Foundational Data”. This powerful type of data can then be shaped into KPIs that power better business decision-making. The diagram below illustrates the gap foundational data fills to help you move from data to intelligence.

The gap between raw data and KPIs

EVEN A SINGLE KPI CAN REQUIRE BLENDING MULTIPLE SOURCES

To understand this better, let’s look at a KPI from Step 1 as an example: Sell-Through Rates. Let’s say Revenue Operations needs to pinpoint where the highest and lowest Sell-Through Rates occur across multiple properties. These metrics could be derived by analyzing delivery logs from the ad server. But what if the client demands only impressions that were viewable? Deriving the answer requires combining delivery data with impression data from the viewability measurement system.

FOUNDATIONAL DATA: A SCALABLE PUBLISHING DATA HIERARCHY

The kind of data blending described above quickly gets unwieldy without a structured approach. Foundational Data is a layer of data derived from raw data. The focus is on modeling canonical “base metrics,” such as Impression delivery, Viewability and Revenue. These base metrics can then be used to derive more complex KPIs, for example, Viewable Sell-Through or eCPM.

There are huge benefits to the foundational data approach. It enables you to establish common data standards between multiple teams across the organization. It also encourages efficient reuse of data for different analyses and reports. When different analyses or reporting starts to be based upon the same set of foundational data, your entire organization benefits through more timely and accurate communication, reduced errors, and ultimately, better decision-making.

Foundational Data providesthe basis for a scalable data hierarchy

You may have already creating a data science team to establish foundational data. While data scientists can help to combine raw data and analyze the KPIs, even the most skilled data science teams lose valuable time deriving a KPI from scratch. When publishing sales and executive leadership come to depend on this data, they will inevitably demand fast and reliable reporting of these metrics—and to do that, you’ll need to make sure you pick the right tools.

Step 4: Choose the Right Tools to Automate Data Operations and Reporting

The value of the metrics from the last section of this paper should be unassailable to digital publishers, which begs the question: “Why aren’t we doing this now?”. The most common reason is operational capability. KPIs and foundational data can only be useful if the integrity and dependability of the underlying data is unquestionable. Such an endeavor is a complex technology problem, requiring real-time integration of heterogeneous data streams on top of a rock-solid and highly scalable operations platform. To address these issues, a Data Operations approach takes advantage of scalable technology, including Cloud Data Warehouses, as well as software that monitors performance and validity at every step of the data pipeline.

HARNESSING RAW DATA WITH EFFICIENCY AND SCALE

With a prioritized list of data sources and an understanding of how each data source and KPI will be handled, next you will want to weigh carefully several important considerations to ensure data sources are managed in a cost-efficient and scalable manner. For example:

  • Monitoring: How will third party API uptime be monitored to ensure reliable delivery?
  • Problem triage: Once we’re aware of a problem, how will we pinpoint if it’s coming from the APIs, or the data warehouse, or somewhere in between?
  • Data quality: If a segment of data fails to load, or some portion of the data was malformed, how will we know? And how will we recover?
  • Data synchronization: How will we ensure that KPIs that depend on multiple sources have up-to-date components?
  • Change management: What happens when an API or data format changes?
  • Data scale: How will we scale up our processing capacity to handle event-level data that grows to billions of rows per month?
  • Operational re-use: Can the capabilities developed for one set of data sources be applied to all of my APIs and file-based sources, so that teams can collaborate using a single approach?

This is not a publishing problem, this is a data problem that requires automation. So, as a publisher, you can try to build an expensive and highly specialized team to write and maintain custom infrastructure, or hire high-priced consultants to do the same thing. Neither approach will deliver a long-term solution that takes advantage of DataOps best practices in a cost-effective way.

CLOUD DATA WAREHOUSES MAKE CONSOLIDATING DATA SIGNIFICANTLY EASIER

Consolidating data streams into one place requires tools that are always available, and scale affordably to handle growing amounts of data. Within the past few years, two commercial cloud-hosted solutions, Google’s BigQuery and Amazon RedShift, have proven themselves best-in-class for this task. However, some publishers still invest in a “Do-It-Yourself“ approach, using traditional IT tools, developing custom software, and staffing ops engineers to maintain on-premises systems.

COMPARING THREE CLOUD APPROACHES: BIGQUERY, REDSHIFT, DIY

Let’s dig into the details of the three most common solutions publishing teams are using to drive business performance:

  1. BigQuery – Launched in 2011, Google’s BigQuery product is an analytical data warehouse. BigQuery can hold an unlimited amount of data, completely hosted in the Google cloud, so there is no on-premise IT to manage. One of BigQuery’s core strengths is lightning-fast speed. Terabyte-scale datasets can be queried in seconds using SQL, a query language well understood by many data analysts. Pricing is based on how much data you query, so you only pay for resources you use. Furthermore, you can use permissions based on Google accounts to define access for your users, which streamlines use and security.
    An important thing to note about BigQuery is that while it’s great for consolidating and querying data, it’s not designed to handle data that must change over time. Once the data is in BigQuery, it requires some developer expertise to manage. Structuring your data intelligently on BigQuery is important to get right because that structure will impact your long term Total Cost of Ownership. Because pricing is predicated on how much data you query, when you query it, understanding the nuances of data structure can’t be overstated: For example, using daily tables will shorten your data sets by reducing the total rows per query for more cost-effective reporting.
  2. RedShift – Launched in 2013, Amazon’s RedShift product is completely hosted on AWS. As an extension of the AWS ecosystem, it offers easy integration with other AWS products and will require some IT/admin expertise to manage. RedShift looks and acts more like a traditional database than BigQuery. However, with RedShift, you pay by instance, not by the query. This means you’ll most likely need to have an administrator to manage your instances, and the administration of a lot of instances can be complicated and time-consuming.
    One of Redshift’s benefits is that behaves like a standard PostgresSQL database and supports common drivers. However, RedShift can struggle when populated with terabytes of data. As a result, one common evolution path we’ve seen is publishers starting with RedShift, then migrating to BigQuery once they reach a tipping point of cost and complexity. Note that migration between big data solutions can be hard and tedious, if you do without sophisticated developer expertise. So, there are benefits to working partners with proven expertise to simplify the process.
  3. Do-It-Yourself (DIY) – One of the great things about working with enterprise technology is the can-do-it ethos that permeates our industry. Explaining the DIY approach in full is outside the scope of this paper, but it’s important to address because a number of companies do elect to go in this direction. The benefit of selecting a DIY solution is that you can tailor your solution to the unique demands of your business, to your existing IT infrastructure, and based upon in-house developer expertise.

FACTORS TO CONSIDER IN BUILD VS. BUY

Regardless of your approach, there are a variety of concerns of which to be aware. First, you’ll need to understand the data requirements of your business and the talent you have supporting it. Like most technical challenges that require significant engineering expertise, publishers face a fundamental “build vs. buy” decision based on the following factors:

  • Data Use – The volume of data and the frequency of reporting/queries you intend to run are important factors that will help you pick the right solution
  • Budget – In addition to in-house staff, it’s important to keep ongoing maintenance costs in mind from both a personnel and licensing perspective
  • Talent – Some publishers are lucky to have sophisticated data analysts on staff, while others may not have a deep technical team on which to rely. These highly specialized professionals can be expensive to hire, and challenging to retain
  • Urgency – Because of your current technical infrastructure and the capabilities of your competitors, you may need to move more quickly

That said, the build vs. buy debate comes into sharp focus with DIY, especially re-sharding large datasets, because it requires such specialized expertise to create and manage. If you’re contemplating the DIY approach, we encourage you to consider if big data is a core competency you need to build within your organization, or if data analysis based on the use of such data is what you really want. If it’s the latter, then you’ll probably want to buy vs. build to keep your technical experts focused on analysis of data and creating solutions on top of that data that will add differentiating value to your business.

The table below summarizes key considerations for the three approaches discussed.

Key DataOps Considerations for Publishers

ROI in Digital Media

Leveraging previously hard-to-use data to increase the performance of publisher operations is an exciting new stage in the evolution of digital media. By combining data expertise with the wealth of fine-grained data available from so many data sources, it’s an exciting time to be in Sales, Revenue Operations, AdOps or Data Science in this industry.

As digital media continues on this trajectory, compelling ROI data from early Switchboard customers is proving the value of Data Ops in the industry. Here are summary ROI stats from two companies with whom we've been working:

MIGHTYHIVE

Regardless of your approach, there are a variety of concerns to be aware of. First, you’ll need to understand the data requirements of your business and the talent you have supporting it. Like most technical challenges that require significant engineering expertise, publishers face a fundamental ‘build vs. buy’ decision based on the following factors:

  • Normalized data from 60 billion transactions per month for real-time analysis of programmatic advertising campaigns
  • Switchboard saved MightyHive an estimated $400K for a turnkey solution vs. building in-house
  • Increased reporting capabilities for customers by 225x vs. previous manual reporting process
  • Slashed fraudulent activity by 85% for Switchboard-prepared data

VICE MEDIA

Regardless of your approach, there are a variety of concerns to be aware of. First, you’ll need to understand the data requirements of your business and the talent you have supporting it. Like most technical challenges that require significant engineering expertise, publishers face a fundamental ‘build vs. buy’ decision based on the following factors:

  • VICE automatically prepares 50TB of annual ad and bid logs into Google’s Cloud Platform with zero additional IT overhead
  • Switchboard accelerated VICE’s most complex big-data initiative by 6-9 months vs. internal estimates for a DIY solution
  • VICE saved an estimated 9,000 developer hours by using Switchboard’s automation and data preparation platform
  • VICE empowered business leaders with real-time decision making based on international cross-platform reports

Conclusion

As you know, bringing the diversity and volume of data that exists within a publishing organization into a single revenue-driving software system isn’t easy—but the rewards are worth the effort. Once the operational foundation is in place to turn raw data into foundational data and KPIs, you can move rapidly towards advanced analytics such as working with event-level data. At that point, insights such as user behavior cohorts, programmatic bid-ask-spreads, and fine-grained revenue forecasting will be within your grasp.

It’s important to make an honest assessment of your company’s skills and existing infrastructure to decide which approach will be right for you. When the time comes, use that same critical eye to gauge what kind of data consumption you expect in order to pick the right solution architecture for your business.

At Switchboard, we understand that the most difficult challenges when working with media data involves the engineering effort necessary to maintain flexible, scalable and operationally robust data flows. Using a tool like Switchboard will help keep valuable technical staff focused on helping make decisions instead of learning a complex new engineering competency.

The results we’ve seen from using Data Operations techniques in publishing have been compelling, but every publisher faces unique challenges when adapting this technology to their business. If you’d like to learn how we can help you, contact us at sales@switchboard-software.com.

SHARE ON