Alex Yahorau [Guest post] How to scale LiveOps development without losing control

[Guest post] How to scale LiveOps development without losing control

By Aliaksei Yahorau, Software Engineer at Meta.

Alex Yahorau [Guest post] How to scale LiveOps development without losing control

The early stages of development in LiveOps are organized simply: one team works, comes up with new features, makes changes to the project, releases a new version, and analyzes the result. When problems appear, they are easy to find. Changes in metrics are also easy to connect with updates. The development cycle is short, there are not many changes, so they can be managed without complex infrastructure.

When the project grows and several teams start working on the live product at the same time, difficulties appear in interaction. One team makes changes to the core of the game, another works on the meta-system, a third on the economy, a fourth on the interface. All these updates can affect one user path and product indicators.

At this moment it is important to quickly develop new features, and keep control over the whole process. When several teams release changes at the same time, problems appear:

  • It is difficult to determine which exact feature affected the metrics. If several changes get into the product at once, it is hard to say precisely what exactly caused the change in indicators, for example retention, monetization, or session length.
  • The risk of breakages increases. The more teams work with the same parts of the system, the higher the probability of conflicts, unexpected side effects, and errors in production (in the working version of the program).
  • Fixes become slower. In mobile development even a small issue may require creating a new version of the application, additional testing, and publishing through the stores. While this is happening, users face the problems.

At moments like this many understand that special infrastructure is required for scaling. A project may grow in the number of people, features, and directions, and the way of working should not break. this means that an increase in the number of teams should lead to an increase in development speed. But the opposite happens: the product becomes harder to manage, more risky, and less manageable.

Why scale makes LiveOps harder

During the development of LiveOps, difficulties appear because the number of teams and changes in the project that affect one user path and product indicators increases. The main problem lies both in the increase of code volume and in the complication of control over processes and observation of changes in the product.

When the team grows quickly and many parallel changes affecting metrics appear, teams try to deal with this by adding additional quality checks (QA), lengthening approval cycles, and making releases more cautious. This really reduces the risk of serious errors, but slows down the overall development process. As a result, the team starts working more slowly, despite the increase in headcount.

In addition, fixing errors in production requires a quick reaction because of the high cost of problems in a live product. Malfunctions can lead to significant losses for the business or a worsening of user experience. If there is no possibility to quickly disable a problematic feature, a full update has to be released. In LiveOps, time is a critically important factor.

To solve these problems, the development of special infrastructure becomes important: feature flags. It allows separate control over code delivery and showing the feature to users, which helps react more flexibly to arising problems and optimize the development process.

What feature flag infrastructure actually is

Feature flag infrastructure is a system for managing application functionality through special flag-switches that allow turning individual features on and off without the need to release a new version.

In the context of LiveOps, a feature may already be integrated into the build, but not active for all users. The code is delivered into the application, but access to the new feature is regulated through configuration on the server or with the help of a remote config system. Product behavior is determined not only by the code that is loaded into the client application, but also by external settings.

Feature flags allow:

  • turning individual features on and off without the need to release a new version;
  • showing different versions of features to different groups of users;
  • isolating experiments from each other;
  • reacting quickly to problems, for example disabling problematic features without an urgent application update.

Thus, feature flags are both a tool for A/B tests and part of the system of product management in real time. How this system is arranged determines the safety and effectiveness of making changes in a live environment. The concrete implementation may differ, but the main principle is the following: the product should be arranged in such a way that features can be easily enabled, tested, scaled, or disabled in a controlled mode.

How feature flags change the development cycle

With the introduction of feature flags, the development process becomes shorter:

  • The team develops a feature behind a flag (that is, makes it so that the feature is available, but not for all users).
  • Rolls it out to a limited share of the audience.
  • First checks technical problems, and then looks at the metrics.
  • If the feature causes failures, it can be disabled.
  • If it negatively affects important indicators, it is also disabled or sent for rework.
  • With a positive result, rollout (gradual introduction of the feature) can be expanded.

It is especially important that unsuccessful experiments stop being a serious problem. A bad idea no longer leads to a painful return to the old state of the system (rollback), but simply means that the feature does not need to be scaled to the whole audience.

Thus, speed and control over the development process stop contradicting each other. The team can work faster not because it became less careful, but because the system reduces the price of a mistake. It becomes easier to test hypotheses, isolate risks, and make decisions based on measurable results rather than guesses.

Why this changes the way teams are organized

When infrastructure for running controlled experiments appears, this also affects the organization of team work.

In large teams management often remains centralized: a product lead or creative director formulates the roadmap, approves details, resolves conflicts, and approves experiments. At a small scale such a model works well, but over time it becomes inconvenient: more and more people begin to depend on one person.

Teams in such a system often are responsible only for task execution. They may implement a feature well and deliver it to release on time, but this does not always improve the product.

LiveOps gradually pushes toward another model of work organization. In it teams focus on measurable results. For this, small cross-functional groups are formed, pod teams, which can take an idea through the whole path, from hypothesis to live experiment, without a long chain of approvals.

In such a model a pod gets responsibility for a certain metric or product area and freedom in choosing methods for improving these indicators. The team formulates a hypothesis, implements the feature behind a flag, rolls it out to a limited audience segment, analyzes the result, and decides whether the solution should be scaled further.

Feature flags make this model practical, allowing small teams to launch controlled experiments without risk for the stability of the whole product.

What LiveOps looks like with pods and feature flags

In the LiveOps model using pod teams and feature flags, the work process looks as follows:

  1. Defining a metric or problem area. The pod starts work with analyzing a certain metric or problematic area of the product.
  2. Formulating a hypothesis. The team formulates an assumption about what may improve the needed indicator.
  3. Building the change behind a feature flag. The change is built and launched for testing on a limited audience through the use of feature flags, special tools for managing the visibility of features in the application.
  4. Rolling out the change. The built change is rolled out to a limited audience.
  5. Analyzing technical problems. First the team checks whether there are technical failures after the introduction of the changes.
  6. Evaluating the product effect. Then the impact of the change on product metrics is analyzed. If the feature causes serious failures or negatively affects important indicators, it is disabled or sent for rework.
  7. Scaling the solution. With a positive effect, rollout (gradual introduction of the feature) is expanded.

Such an approach allows the product to depend less on rare and risky releases. Instead of this, developers make smaller and more controlled decisions, which do not break the general rhythm of work and allow quick fixes when necessary. This contributes to increasing the flexibility and adaptability of development.

The use of feature flags makes it possible to separate the process of code delivery and showing the feature to the audience, test hypotheses in controlled conditions, and react quickly to problems. As a result, the work becomes more focused on real product effect, and not simply on completing tasks from a list.

Check Also

images Alan Boswell Group announced as sponsors at MCV/DEVELOP Awards

Alan Boswell Group announced as sponsors at MCV/DEVELOP Awards

Alan Boswell Group is to sponsor the branded guest wristbands at the forthcoming MCV/DEVELOP Awards …