January 30, 2019

Offline Synthetic Testing: A Quick and Safe Method to Improving Search Results

When tuning website search many people either fix individual queries or run A/B testing.
However, if you want to be 100% sure that your search is working as it should, consider using offline synthetic testing. It rapidly iterates search algorithms and eliminates the risk of losing conversions by letting you test changes safely offline.

Every eCommerce business wants to have a great search function, but the road to it is tricky. It requires great observational skills, good knowledge of search analytics and a systematic approach. Most people, however, just follow their gut feeling, which is not the ideal way to go.

The Power of Offline Synthetic Testing

One of the best methods, we’ve found, to improving search results is called offline synthetic testing. Offline because you don’t need live users for testing and synthetic because you use only measured data as your baseline – live users may behave a bit differently. The advantage of this method is that you risk no negative impact on your conversion rate, you get results rather quickly and you only need search logs (or more precisely: queries, results and user interactions with those results).

The method works like this: you scan users’ past searches and run them again with a new ranking algorithm. If you know what results your search returned in the past and which of those results were clicked on or converted, you can compare them with the results from new ranking algorithm.

Analyzing Results from Offline Synthetic Testing

Imagine that for query x, the old search returned product X at position 1. This means product X was the highest-converting item for this query. With the new algorithm, new search for query x returned product X at position 10. Can you guess which search was better? The right answer is the original search.

That was easy! Now let’s take a harder example: query x returns product Y as the best result. Some people click on it; therefore, it seems like a relevant match, but the real best result X is not among the results. Your new search fixes this by ranking product X first and product Y second. What may seem like a regression is, in reality, not, because results for query x are better.

As you see, in many cases analyzing what’s better isn’t that simple and you need complex models to get reliable and actionable results. There are various quantitative models designed to measure ranking quality such as Normalized Discounted Cumulative Gain (NDCG), Discounted Cumulative Gain, Mean Reciprocal Rank or ranking correctness (Precision, Mean Average Precision, etc.).

All these metrics are often measured only over top-n results because these have the highest chance to be seen by the users. There are even more robust and complex models based on implicit feedback that are more suited to search quality modeling. Each one of these metrics has different characteristics and models different search quality aspects, but in the end, whichever you choose, you will have a quantitative measure of your search that you can use to compare different ranking algorithms.

Beyond Measuring Search Quality Change

Offline synthetic testing allows you more than just measuring the search quality change — it enables you to understand the why’s of the performance differential. If done right, you can get rich reports and aggregations of individual historical queries performance with the new ranking. You can then start asking questions and see the exact queries where the new ranking helped, or where it made the results worse.

From our experience, going through a few queries where the search performance was most severely impacted can help you identify patterns where the new ranking is underperforming. You can then correct your assumptions, update the ranking algorithm and re-run the offline test.

Summary output of a single test run. In this case the ranking has improved by 43%. Below the summary, you can see a performance breakdown (for buckets in 25% granularity) and the number of queries that were improved or deteriorated. Notice how even if this change resulted in an overall improvement, there are queries where ranking is now worse.
Example output from an offline synthetic test. You can see the users’ query, the results that was clicked from that particular search and how its position changed in the new ranking (position_from – position_to). You can also see the computed metric – NDCG@10 (Normalized Discounted Cumulative Gain computed over the top 10 results).

The Best Way Is to Combine Methods

The good thing about offline synthetic testing is that it reveals important findings. If an offline synthetic test shows that your new search is much worse than your current search then it will generally be worse. If your offline synthetic test shows that your new search is much better than your current search then it will generally be better. But by how much?

That’s something that this method cannot tell you since it doesn’t represent real world behavior. To find out, you need to run a live A/B test. So, next time you want to fix something:

  1. Find a query that needs improvement.
  2. Update the ranking algorithm.
  3. Run offline synthetic tests until you are sure that there’s an improvement.
  4. Run a live A/B test to confirm.
  5. Rinse and repeat.

This is the way we do it at Luigi’s Box and it’s given us the know-how you’ve grown to trust. The good news is we’ve integrated offline synthetic testing into Luigi’s Box, so if you want to try it out, you don’t have to develop it yourself.

To find out more about this feature, feel free to contact our sales representatives.