When we want to understand campaign effectiveness and the impact of changing individual variables in our marketing campaigns, we often use an A/B test.
If we want to understand the impact of creative changes, for example, we’d devise an A/B test in which half of the audience sees the original creatives, and the other half sees the new creatives. We then run the ads to each audience segment over a period of time, look at the difference in performance, and assess whether one set of creative performed better than the other.
A/B tests are very straightforward to set up and run. But, not everything can be measured via an A/B test, and with user-level data only becoming harder to access, it’s good to have some alternatives. Today we’ll discuss the limitations of A/B tests, and how difference in differences (DID) testing can help.
To A/B test the effect of a promotion fairly, you have to ensure that just half of your audience can see the promotion. This is far easier said than done.
It’s simple to ensure that only half of a specific channel sees your promotion — emailing only half of your audience about it, for example. But there’s no way of isolating half your audience across different channels. You lose your control group, which is problematic, because you cannot tell whether users see the ad or not, across multiple channels.
Above the line (ATL) campaigns
ATL refers to the running of mass media, broad-targeted campaigns. It typically involves channels like television, radio, or print, but can also include digital channels like YouTube as well as digital audio, when targeted to a broad audience.
The ideal way to measure the effectiveness of these channels is through a specific type of A/B test called a lift test. A lift test works by only running campaigns to half of your target demographic, and measuring the difference in conversions between the two groups after a period of time — that is, measuring the “lift” the ad was responsible for.
This methodology doesn’t work in the context of ATL campaigns, though. That’s because with channels like television it’s not possible to unbiasedly split your target audience into groups of people who do and don’t see your ads — a necessary step to running a fair lift test. So, this means you can’t use a standard A/B test setup to measure the effectiveness of most ATL campaigns.
In both of our examples above, we’re trying to measure the effect of something. And while A/B testing is an inadequate solution, we have some other options.
Difference in difference (DID) testing is a method that measures the effect of change by looking at relevant metrics before and after the change, and then comparing how they differ to a reliable benchmark. The use of a benchmark is critical to DID testing; it’s what allows us to isolate the impact of the intervention from any changes that would’ve happened without the intervention.
Let’s say we want to measure the effectiveness of a particular ATL channel on our brand’s conversions.
Now, you might wonder, why can’t we just run the ATL channel for a specific time period and simply look at the difference in conversions before and after? It’s a reasonable idea in theory, but if conversions are affected by seasonality, or any other time-dependent factor, then we won’t get reliable results.
Instead, we can run the ATL channel in a few select states (which we’ll call our “ATL states”), and use the remaining states as potential benchmarks (our “benchmark states”). We run the ATL campaign in our ATL states, and then we compare our conversion numbers before and after the campaign to an appropriately chosen benchmark.
At a basic level, the impact of our ATL channel tells us the difference between:
Since our benchmark states help predict what would have happened in our ATL states under ordinary circumstances, the difference between the prediction and actual results is attributable to the ATL campaigns.
When choosing a benchmark for DID testing, it’s important that the behavior of benchmark states correlates highly with the states where we’re making a change (that is, running ATL). This is critical for ensuring accurate predictions.
There are a few options for choosing a benchmark. You could manually work out correlation coefficients between your ATL states and all your potential benchmarks, and then choose the states where your brand’s conversions are most highly correlated. Alternatively, you could use a code package like Google’s CausalImpact to help automate the process.
Code packages like CausalImpact have the additional advantage of grouping different benchmark geographies together, which produces more reliable benchmarks than you’d get from single geography.
The benefits of DID testing go beyond just measuring ATL channels. DID can help you measure the impact of any change when you don’t have access to user-level data, or when you can’t neatly split users into the control and experiment groups needed to carry out an A/B test. With advertisers’ access to user-level data slowly diminishing, the value of DID testing will only grow.
While calculating results from a DID test can be a little technical, simply knowing these kinds of tests are an alternative option to A/B testing can be extremely valuable for modern advertisers.
If you want to learn more about how to run effective marketing campaigns, let’s talk.