The mobile app market is very competitive these days. This is certainly good news for consumers. The high level of competition in all niches and categories means that more quality apps are being placed in the app stores.
But it also means that it is becoming more and more expensive for developers to create and promote a product. The cost of acquiring users is going up, and it's getting harder and harder to scale app revenue. At the same time, the subscription-based monetization model remains the foundation for generating impressive profits.
A/B testing, or experimentation, which many people are familiar with, can help achieve these results.
How do you test different paywalls, prices, and subscriptions in practice?
The implementation can vary:
Let's take a quick look at the pros and cons of these solutions.
Pros
Cons
An in-house solution may be appropriate for teams with deep in-house expertise in building revenue analytics and mathematical split-testing algorithms. The advantage, in this case, is the flexibility of such a system. The disadvantage is the complexity of implementation and testing.
Pros
Cons
Firebase is a popular solution from Google, which is well-suited for testing product hypotheses and conducting simple UI tests. The disadvantages - it takes a long time to get the released variations (critical for tests when you first run the application), and there are restrictions on the number of parameters of events. Also, Firebase is not tuned for testing on the fly.
Pros
Cons
A dedicated paywall testing solution is best suited for this particular task. We at Apphud have created a tool called Experiments that allows you to easily run an unlimited number of tests without having to upload builds to the store and test up to 5 variations simultaneously.
Well, we've chosen a solution for the test, and we've figured out what we want to test. Now, let's go over some general things to keep in mind while running the test.
For example, you see a low conversion rate on the first purchase and want to increase it by emphasizing product benefits and the buy button.
But it could be that people aren't buying because of an offensive error or omission. In this case, a split test will do little to help you. It's worth taking a closer look at the product analytics and bug logs to formulate a more meaningful hypothesis.
To be sure you understand exactly which change affected the outcome of the experiment, you cannot mix several different hypotheses within one test.
It usually takes 2-3 weeks on average to test one hypothesis in one experiment. Also, a sufficient amount of traffic is required to obtain a statistically significant result.
This is the most important right for successful hypothesis testing and getting a true result. After launch, you cannot make new releases or change application features using Remote Config if these changes will affect the user experience in the running experiment.
Although you cannot make changes to a running test, it is possible to adjust the distribution of traffic between variations. For example, if you are afraid that a new variation being tested will have a strong negative effect, you can initially allocate 10-20% of the traffic to this variation, and increase this to 50% (e.g., 2 variations being tested) if no changes that dramatically worsen the metrics are revealed during the course of the test.
Unfortunately, this is the average statistic on the market - the vast majority of experiments fail (either by not showing a statistically significant best result, or by showing a statistically significant worst result). Quality analytics and market analysis can help increase the percentage of successful experiments, and also with regular iterative tests in the product, the quality of tested hypotheses grows.
In an ideal world, we would want to test every possible change to a product. But as we know, testing every hypothesis takes a lot of time and traffic. So, prioritize and test only those changes that can have a significant impact.
Don't stop testing after one or more experiments fail. Remember that testing is a continuous, iterative process. Accumulate knowledge and draw conclusions from testing, which will help you formulate better hypotheses and conduct successful experiments in the future.
It is considered bad practice to observe interim results during testing. It can distort the perception of the final results of the experiment. However, if you don't interfere in the course of the experiment and don't jump to conclusions about the victory of this or that variant, it can be useful to watch the analytics of the launched A/B test. For example, you can stop a clearly unsuccessful experiment without wasting additional time, or you can redistribute the traffic between the tested variants.
To illustrate how a poorly designed experiment can affect the outcome, let's break down an example.
Suppose we have a subscription app that attracts traffic through Apple Search Ads. However, the app has a simple paywall whose design hasn't been tested in a long time.
Recognizing the opportunity to increase conversion rates, the app owners drew 9 different variations of the app paywall to test and compare with the current one.
At first glance, it seems like a great experiment - we are testing a lot of visual solutions at once, and we should definitely find the best one with the highest conversion rate.
But before we run such a test, let's take a look at how many users we need to successfully complete it.
Let's say we want to test in a country where we have users with more or less the same purchasing power, let's say the USA. We know that in this country the average conversion rate with the current app paywall is 2% and we hope to increase it by 25% (a relative value).
Now let's predict how many users would be needed for such a test and how long it would take.
So what are we seeing? Even if we get 1,000 new users per day in the US from Apple Search Ads (which seems very optimistic), it would take us a full 127 days to see a statistically significant change!
If we hadn't modeled the experiment and done the calculations above, we could have run the test and waited forever. In this case, we need to select a smaller number of the most likely hypotheses, run a quick test, and then iteratively run other experiments.
Thus, the design of the experiment itself can significantly affect its outcome and, in some cases, ruin the experiment even before it is actually run.
Apphud supports the ability to test up to 5 independent variations simultaneously. This is sufficient for any reasonable experiment.
Another situation. We regularly run split tests and the results seem suspicious. Sometimes completely strange hypotheses work that is difficult to explain logically, and more correct assumptions do not work.
Of course, the results of experiments don't always fit our logic of assumptions, but still, if the results of a series of experiments don't paint a coherent picture of the project's development and the application's revenue doesn't behave as planned, it makes sense to dig deeper.
This is where A/A testing comes in. The point is that even if we have correctly hypothesized and prepared the experiment, in the case of heterogeneous traffic we can see changes caused by this factor and not by changes in the product.
What is the purpose of these experiments?
We test 2 identical variants (usually default ones) against each other on a user volume equal to the expected sample size for A/B testing. As a result, we should see no statistically significant difference in the metric being tracked. This means that our target audience is behaving approximately the same in the 2 groups and there are no factors that can distort the results of the experiment.
When we run experiments, we want to be sure that we have interpreted the results correctly and that we know all the key metrics of the test. Analytics can help us do this.
So, how do you look at these metrics and draw the right conclusions? Conversion metrics work well right after you run an A/B test because you can get quick results. ARPU/ARPPU/ARPAS, on the other hand, are cohort metrics that show subscription revenue growth over time.
For example, if we're testing button color or text, we're primarily looking at trial/subscription conversion rates, but if we're testing different products and their prices, we're interested in long-term ARPU/ARPPU/ARPAS.
Another example: If you increase the price of a subscription, you will certainly have a negative impact on the conversion rate, which could be considered a failure. But we're experimenting with paywalls, among other things, to increase LTV and revenue on a 1-2 year horizon, so if a more expensive product pays off in the long run, you're safe.
A great case study on product testing from AEZAKMI Group. Here we see a simple example of an A/B test using Apphud Experiments that compares 2 different products - with and without testing.
According to the results, even such a seemingly simple change resulted in an excellent increase in key metrics!
It's easy to see that the conversion rate for a trial start is almost twice that of a monthly subscription.
It's also critical that the trials allow you to track ARPU, a metric that accumulates both conversions and rebills. After all, one subscription may convert very well, but users will immediately cancel it, and another may convert 2 times worse, but at the same time have 4 times the renewal rate of the subscription, which means it will be more profitable in the long run.
In this case, we see an APRU increase of almost 2 times, which is a great result!
In order to have a competitive application in the market, in whatever niche it may be, it is necessary to constantly test many hypotheses and conduct experiments. This is the only way to achieve an excellent ROI and scale the product (and therefore profit).
Apphud's A/B testing tool is ideal for the task of testing different products as well as paywall design or onboarding screen design. Sign up and try now!
Have fun experimenting and growing your apps!