A/B testing in Google Play: Step-by-step guide to Store listing experiments
A/B testing for mobile apps is one of the most powerful methods for pushing your app’s performance and visibility. The logic and goal behind this concept are simple - by testing different app elements, you should be able to find the best-performing metadata and creative assets.
Google Play offers an A/B testing feature called Store listing experiments. The feature is available to all app and game publishers inside Google Play Console. Although other paid platforms allow A/B testing of multiple elements, Google Play offers Store listing experiments for free.
Knowing which app elements are essential for users in the app stores is one of the most critical aspects of effective app store optimization (ASO) for Google Play. As app marketers, we should perform regular A/B testing to determine which app elements have the most significant impact on conversion rates and, consequently, on higher visibility, more store listing visitors, and app installs.
This article will give you a quick overview of Store listing experiments in Google Play. We will also explain how to start testing your store listings, including best practices pros and cons of A/B testing.
What are Store listing experiments in Google Play?
Store listing experiments are a native A/B testing tool for Android apps. App publishers and ASO experts can use this tool to find the best-performing metadata and visual assets that impact the app conversion rates.
Most app publishers will have different messages and images for different localizations in Google Play. Store listing experiments are a great way to test your hypotheses and check how your assets perform compared to each other and your expectations.
Why should you do mobile A/B testing in the first place?
A/B testing for mobile apps allows you to try out different ideas and explore opportunities that can impact your app conversion rate. Being able to rank in Google Play or App Store is not enough - you need to hold high keyword rankings and simultaneously appeal to the users that land on your store listing and convert them into installers and app users.
Once users come to your store listing, you must convince them to install an app or a game. Store listing creatives are great for that and significantly impact the conversion rate.
So how can A/B testing help you increase those conversion rates?
Here is what you can do with a proper A/B testing strategy in place:
- Find metadata elements (name, short and long description) that resonate the best with your target audience
- Locate graphics and creative assets that people like
- Get more app installs
- Boost the retention of your users
- Tap into the granular aspects of how users behave
- Get insights on the elements that are valuable to local audiences
- Test big and small changes and seasonality effects
- Improve the general knowledge about the efficiency of app elements
What can you A/B test in Google Play?
There are overall six app elements that you can A/B test in Google Play:
- App icon
- Feature graphic
- Promo video
- Short description (only available for localized experiments)
- Full/long description (only available for localized experiments)
Check our Google Play academy to understand better each element and why it is essential for Google Play ASO. And if you want to learn how to do A/B testing with iOS apps, read our guide to Product Page Optimization in Apple's App Store.
Unfortunately, you cannot test app names with Google Play’s Store listing experiments or with Apple’s Product page optimizations. Nevertheless, Store listing experiments allow you to test all other vital elements, which makes it very convenient for Android publishers.
To test app titles, you will need to consider paid tools like Splitmetrics or Storemaven. While these tools can help you with this, you should be aware that they use different approaches for A/B testing. But if you want to test every aspect of your store listing, check out those tools.
Understanding the terminology
Before diving into the specifics of Store listing experiments, you should ensure you understand the most important terminology. It will help you with interpreting test results and allow you to make smarter decisions.
- Target metric is essential for determining the experiment result. You can choose between retained first-time installers and first-time installers (which doesn't consider any retention metric). Both metrics refer to users who installed the app for the first time. Still, the retained option looks at users that kept an app installed for at least one day, which is a more appropriate target metric because those people are generally the ones we are interested in.
- Testing variants. For each test you run, you can choose one or more experimental variants to test against the current store listing. A single variant will be the only thing your test audience will see. However, you can choose up to three testing variants if you like, which will save you the time spent on testing, but at the same time, it will decrease the size of the testing audience.
- Experiment audience. This element refers to the percentage of store listing visitors that you want to see your test/experiment variant. And if you have more testing variants, the store listing visitors will see both variants equally. For example, suppose you want 50% of your audience to see experiments and have two testing variants. In that case, 50% of your visitors will see the current store listing, 25% of visitors will see the B testing variant, and another 25% will see the C testing variant.
- Minimum detectable effect (MDE). This is a minimal difference between the test variants and the current store listing you want to detect. For example, if you have a conversion rate of 10% and you set MDE to be 20%, your test would show changes between 8% and 12% (because 2% is 20% of your 10% conversion rate and the test changes would be shown for both increased and decreased conversion rates). Important to note is that a smaller MDE requires a larger sample size to be significant and vice versa. And if you already have a high conversion rate, you don’t need a significant sample size, and vice versa - the smaller the conversion rate, the bigger the sample size you will need.
- Attributes. This aspect refers to the element you want to test (icon, description, video, etc.). We suggest focusing on one attribute simultaneously to have more significant results.
Google Play allows you to edit the estimates to understand how long your experiments will last.
- Daily visits from new users - the more you want to get, the longer you will have to run the test.
- Conversion rate - your expectation about how many store listing visitors will be converted to first-time installers.
- Retained first-time installers - the estimations about users who install your app for the first time and keep it installed for at least one day
- Ordinary first-time installers - estimated users that install your app for the first time without considering the retention period.
Google Play updated the Store listing experiments in 2022 and brought a couple of new elements to have better testing results (which Apple already implemented with their Product Page Optimization feature):
- experiment parameter configuration
- sample size calculator and test duration
- confidence intervals that allow for continual monitoring
Now that you understand the main concepts let’s move on to the preparation for your test.
Organizing before the test
We have already mentioned that A/B testing is essential to your app conversion rate optimization. As such, you need to approach it carefully - without a proper setup, you won’t get reliable results, the confidence levels might be too low, you might get false results, and as a result, you might choose to implement wrong decisions.
To avoid these outcomes, we recommend looking at each of the following aspects during the preparation.
Create an A/B testing plan
Thinking in advance about what, why and how you are going to test should always be the first step. Examine your current data and things that you want to improve and put everything down before starting a real test.
Always try to narrow down the text context as much as possible. That way, you will ensure that different results come from test variations rather than differences between the users. For example, don’t test too many changes (screenshots and short descriptions) simultaneously, and don’t run multiple tests for the exact localization.
Number of testing attributes
Testing too many things at the same time can create confusion and the absence of a clear picture. It is hard to say which element contributed the most to improved performance. In short, don’t mix video, image and description changes.
Data quality and quantity
Test results can change and revert during the test time. What often shows like a clear winner may become the worst test variant after leaving the test to run for some time. Of course, suppose your testing variant receives a lot of traffic. In that case, you can increase the confidence level, but if your testing variant struggles with getting enough traffic, make sure to leave the test running before applying the results.
Before starting an A/B test in Google Play (or any other platform), pay attention to existing paid campaigns. Keep your paid campaigns on the same level and similar budget; otherwise, you won’t know if your A/B test was successful.
It would be best if you kept your tests from being interrupted or disrupted by seasonal effects. If you do tests during a holiday season, you might see unusual uplifts in results, which might not be attributed to your testing experiments. Run a campaign for at least seven days to include weekends and traffic anomalies.
Testing big vs. small changes
A famous piece of advice for A/B testing is to test significant changes with each variation. In general, those significant changes will have more significance and be seen by both current and testing user groups. On the other hand, significant changes might be problematic with other channels. For example, you might see that an entirely new app icon gets more installs, but if you want to keep it, you need to align it with your brand standards, which may be harder to implement.
In short, significant changes should be tested, but make sure they make sense for your app.
How to create and run an A/B test step-by-step
Now is the time to create and run your A/B test using Google Play Console.
You first need to log in to your Google Play Console account, choose your app, and navigate to the “Store listing experiments” tab under the “Store presence” section.
You will come to the setup screen, where you can create an experiment or A/B test.
Let’s go through each step from start to finish.
Step 1 - Preparation and creation of the experiment
The first thing you need to do is to name your experiment. We suggest using a descriptive name and, simultaneously, allowing you to distinguish between different experiments you will run. The test name is visible only to you and not to Play Store visitors, which means you should know what the test was about just by looking at the test name.
For instance, if you want to test an app icon for your German localization in Germany, you can use something like App icon_DE-de. The first part will tell you what you are testing, and the last will refer to the country and language used in your test.
The second thing is to choose the store listing type you want to test. If you don’t run Custom store listing pages, then your only option will be the Main store listing.
Quick reminder: Custom store listings are used to create a store listing for specific users in the countries you select or if you want to send the users to a unique store listing URL. For instance, if you run paid campaigns or want to target a specific language in a country with multiple official languages (like Switzerland, Canada, Israel, etc.)
The third step is to choose an experiment type. Here you also have two options - you can target your default language or select a localized experiment (you can have up to five localized experiments at the same time). Also, localized experiments allow you to test short and long descriptions, while default experiments don’t have this option. We highly recommend running localized experiments.
Once you are done with this, click next and proceed to the next step.
Step 2 - Set up the experiment goals
Now comes the part where you can fine-tune your experiment settings, something we already discussed in the previous part of this guide. You want to get this right because the setting you choose will influence the accuracy of your test and how many app installs you will need to reach your desired result.
Here is the exact list of things you need to know.
Target metric is used to determine the experiment result. You can choose between Retained first-time installers and First-time installers. Going with the first option is recommended because you generally want to target users that keep your app or game installed for at least one day.
Here you choose the number of variants to test against the current store listing. Generally, testing a single variant will require less time to finish the test. Google Play Console will show you next to each option how many installs you need.
It is up to you to choose how many variants you want to test, but we recommend starting with one until you get more comfortable with the tool.
The experiment audience setting is where you choose the percentage of store listing visitors that will see an experimental variant vs. your current listing. If you have more variants you test (e.g., A/B/C test), the testing audience will be split equally across all experimental variants. Each testing variant will get the same amount of traffic for your experiments.
Minimum detectable effect (MDE)
As mentioned before, you can choose the detectable value that Google Play will consider to evaluate whether the test was a success. You can select preset percentages from the drop-down menu and see the estimations from Google Play, that is, how many installs you will need to reach a certain MDE.
This is a new option that Google Play recently introduced to Store listing experiments. You can choose between four confidence intervals, which wasn’t possible before. The higher the confidence level, the more accurate your Store listing experiment results will be.
Also, higher confidence levels will decrease the probability of a false positive, but you will need more installs to reach those higher levels.
As a general rule of thumb, we suggest choosing a 95% confidence level, as this is an industry-standard with testing in general.
The end part of this step summarizes when your experiment is likely to be done in days and how many first-time installers you will need to complete the experiment.
You can edit the estimates by clicking on the “Edit estimates” button and if you are happy with it, proceed to the next step.
Step 3 - Variant configuration
Now you come to the part where you can choose which attribute you will test and what your test variant will look like.
As mentioned before, you can choose from six different elements and app descriptions will be available only if you have chosen to run a localized experiment.
The recommendation is to test one attribute at a time and to run only one attribute test for that specific localization.
Depending on the number of variants you chose to test in the previous step, you will have one or more testing variants that you can customize. Each test variant needs to have its name and the text or image you want to test against the current store listing.
For instance, if you want to test a short description, your test might look like this:
- Current store listing short description: “Share images and videos instantly with your friends.”
- Name of the testing variant: “Test A_short description.”
- Testing store listing short description: “Image sharing and easy video editing features in one place.”
Once you set up your variants and are happy with your current setting, click on “Start experiment,” and Google Play will soon make your experiments live.
A/B testing could also help with indexing new keywords. For instance, short and long descriptions influence keyword indexation. So just by testing new description variants with keywords that you don’t use with current store listings, you might be able to get indexed in a new set of keywords. Although this shouldn’t be a long-term tactic, you could get more visibility by doing A/B tests with app descriptions.
Measuring and analyzing your test results
Every test you create will be listed under the “Store listing experiments” tab. The first thing you need to do before running any analysis is to let Google Play run the data, usually for at least seven days, to avoid any weekend effects and to have enough data.
For each test you run, Google Play can provide you with additional data:
- “More data needed”
- Recommendation to apply a variant if it performed well
- Recommendation to leave the experiment to collect more data
- Draw the result, which is then up to you to decide if you want to apply the testing variant
- If your current store listing performed better than the test variant, you would get the recommendation to “Keep the current listing”
You will also get a list of metrics that you can follow during the experiment:
- Number of first-time installers
- Number of retained first-time installers
- Test performance that lies in a percentage range
- Current installs
- Scaled installs
Scaled installs are the number of installs during the experiment divided by audience share (e.g., if you have a 50% audience split, your scale installs would be the number of installs/audience split. If you have 1000 installs and a 50% split, scaled installs would be 1000/0.5 = 2000 installs.
Analyzing the results with more insights
Google Play will show you the best-performing test versions, but there are some additional things that you should pay attention to.
Here are the five things that you need to consider when analyzing the results:
- For a start, you always have to think about the seasonality. Google Play has intelligent algorithms; you are the only one that should understand why a specific variant performs much better or worse than the current store listing.
- If you use more testing variants, they will receive traffic from different sources and keywords. If your keyword rankings change over time, the changes might impact some variants by those changes, which means that test results will be affected by external factors that Google Play doesn’t show.
- Google Play testing can result in false positives. To check if this is the case, you can run a B/A test after to check if your B variant will perform the same against the A variant. But an even better way would be to run an A/B/B test. In that case, if both B variants perform the same, you can rely on the results. Still, if there is a large discrepancy between both B variants, the test probably has sampling issues, and you shouldn’t implement the recommendations.
- Always analyze the results carefully. Even if you don’t implement Google Play recommendations, you won’t lose much of your invested time. But if you implement a test result that didn’t have enough data or used poor data quality, you might harm your conversion rates.
- If you apply the testing results on your live store listing page, monitor your conversion rates and compare them with the performance before the implementation. Just because the testing variant performed better during the test period doesn’t mean that your KPIs will also improve. Annotate your test in your KPI report and watch how they perform.
Getting a negative test result doesn’t necessarily need to be a bad sign. If you notice that some elements perform poorly, you can eliminate them and similar directions from your app. This should show you the other things you should test and get you to try different things with aiming for a positive impact.
Pros and cons of Store listing experiments and mobile app A/B testing
Based on our experience, A/B testing in Google Play has pros and cons. Here is the list of good and not-so-good things about Store listing experiments.
Store listing experiment pros
Using Store listing experiments helps discover significant changes by testing new ideas and approaches that are different from your current app marketing process. The tool is free with a native setup, a powerful function that external A/B testing tools can’t offer.
External A/B testing tools are a great way to test more granular things that Store listing experiments can’t cover. However, they use a “sandbox environment” to attract the audience to a testing variant. You need to run paid campaigns and send clicks to dummy store listing pages to do that. Once the users come to those dummy pages, the A/B testing tools measure how users interact with them.
Furthermore, you can experiment with new trends and try out new features that can bring more life to your usual and perhaps boring store listing.
Since Store listing experiments are easy to set up and run, you can test your brainstorming and research ideas to find something new that benefits your store listing and that you can share with other departments you work with. E.g., If you test and realize that an entirely different screenshot design produces much better app installs, your colleagues in the design department can use this to improve their work and output.
Without the A/B testing tool, you wouldn’t dare to go for significant changes. You can test bold and small changes with Store listing experiments and get reliable results.
Store listing experiment cons
Some of the positive elements can also come with risks at the same time.
If you test big changes on a large portion of your audience, you could negatively impact your regular performance if the test variant performs much worse than the current store listing. That is why it makes sense to test significant changes with a smaller percentage of traffic first and then scale it up gradually to a bigger audience size.
Another drawback is that big and bold tests require preparation. If you want to test a completely new app icon, video, or app screenshots, you will have to dedicate some resources, even if the outcome could be more predictable.
Try to test significant changes that are very different from the elements in your current store listing. You might need help understanding which part of the test variant had the most significant impact on your test performance.
Furthermore, regularly testing significant changes can take a lot of work. Not only will you need a lot of ideas, but it might be counterproductive to test completely different app variations one after another and with little time difference.
Finally, small incremental changes allow more straightforward results interpretation and scaling options (e.g., you test something in one localization and then repeat the same for other localizations). They might provide minor improvements that may fall within the test error margin.
Store listing experiments limitations
Store listing experiments do come with some limitations. While we think that they are the best way to perform a test of a live store listing and that you should use them consistently, you need to be aware of their limitations:
- You can’t choose the traffic sources for your test - Google Play will use all traffic sources (search, browse, and referral) for testing.
- No additional metrics would show the monetization value of the users that were a part of your tests, such as revenue.
- If you plan to run multiple tests and test variants with different attributes, you won’t be able to tell the effect of each attribute.
- Finally, we would like to see how much people are engaged with your app after installing it, but it isn’t possible.
Best practices and things to remember
The general testing recommendations are to test one thing at a time. Still, if you test multiple changes, you might get a more statistically significant outcome and improve the performance than if you had tested each element separately.
Generally speaking, we advise our clients to think about the following aspects when doing A/B tests:
Have an A/B testing plan
Think about the testing ideas in advance. Know that you can test different image headlines, splash screenshots, screenshot order, screenshot approach (e.g., emotional vs. fact-oriented), messages, etc.
Set up basic testing rules
If you are starting with A/B testing in Google Play, try to test one element and one hypothesis at the same time. Also, run each test for at least one week before making conclusions.
Know why you want to track something
Track properly what you change and have a reason why a particular change should improve app performance.
Strong hypothesis before anything else
Have a strong hypothesis - this part matters the most. For instance, you may be using the same screenshot types for all localizations and want to adapt them to the local audience. So, in this case, a good hypothesis would be that localizing screenshots and messages will have at least a 5% increase in the conversion rate from store listing visitors to app installs.
Multiple variants testing options
If you test multiple elements - continue performing tests even after your original tests are done. You can do that with B/A tests or, as previously mentioned, A/B/B tests. This will help you assess the overall confidence that you got the correct results and help you with future tests.
Learn from bad performance tests
Negative tests should not be seen as a failure - take those as a learning opportunity to understand what your potential users don’t like.
Know the test parameters
When performing test analysis, always consider how many users were a part of the test. Check if the test duration was appropriate according to that number.
Big and small tests are fine
Have a good understanding of when you want to test big changes (e.g., with graphics) vs. small changes (e.g., messages).
More data equals more relevancy
Localizations with higher conversion rates will take less to complete the test — the larger the sample size and testing volume, the better.
Adapt the test duration
Run the tests long enough, but if you notice that test variants are performing strongly worse, abort the test, so you don’t impact your general conversion rate. This is important, especially if your testing variant is shown to a large sample.
Different tests for apps in different stages
If your app is in its development and lifecycle, test different concepts by doing A/B/C/D tests to find the winning combinations.
Be patient for the results
Finally, give your test enough time. Use the scaled installers metric if the install pattern remains stable.
We hope that you understand how Store listing experiments work. The A/B testing experiments should be one of the most common ASO tactics you need to use.
For a start, Store listing experiments use actual Google Play store listing traffic, are free to use, and come with basic retention metrics, such as retained installers after one day. Because you can set confidence levels, detectable effects, split test variants, and easily apply winning combinations, it makes Store listing experiments pretty powerful and easy to use.
Although they come with some limitations (absence of engagement metrics, random sampling of traffic sources, and potential false positives), you should embrace this tool and use it as much as possible with your daily Google Play optimizations.
If you want to scale and get the most from your app A/B testing, get in touch with App Radar's agency and services team. We regularly conduct mobile A/B testing for the biggest brands and apps, and we can help you with pushing app installs and conversion rates in all app stores.