Categories

In short, performance has improved but we still have some work ahead of us. Specifically, TestFlight should be much faster but TF SDK / TF Live improvements are still in progress.

Below is a list of completed improvements and upcoming items.

New Faster, Simpler SDK front ends

We are primarily a python shop, in wanting to stick with what works and add some simplicity to our stack we built a raw gevent based server. These servers depend on a local redis instance that slaves off our redis cache master. The complete server role looks like stud -> nginx -> gevent -> local redis. TestFlight core updates the cache master and redis takes care of synchronizing the front end caches. We are really pleased with this approach. It has decoupled the SDK from Web nicely and it has given us the ability to burst to the cloud when necessary.

For developers interested in other details here is a hit list of some of the less obvious things we ran into:

  • [Improvement] nginx did not handle our SSL traffic, moving to Stud showed significant improvements.
  • [Improvement] one of our bare metal SDK boxes keeps up 4 SDK VMS. Bare-metal ftw!
  • [Cool] “fab deploy.sdk:production,git tag”, now pings a deploy server which packages the git tag and each SDK front end periodically checks in with the deploy server to get the latest package.

Faster TF Core

TestFlight users are first and foremost distributing betas. Not only do we take pride building tools developers love to use, we also strive to provide a seamless experience for your testers and stakeholders. In the last few weeks we have made strides in improving the reliability and performance of TestFlight core.

The addition of the new SDK front ends isolated SDK web traffic and database read traffic. This has had the biggest impact on the perceived performance of TF core. Here are are a few of the other things we addressed:

  • [Fixed] Turns out our continuous deploy process since migrating our infrastructure caused failed requests . Formerly we had gunicorn operating on two ports behind nginx, so nginx masked this issue by kindly directing the request to the second port. We’ve resolved this.
  • [Improvement] Switched from meinheld to gevent as our gunicorn worker class. This improved performance by virtue of it having more consistent behavior and better logging so we could actually see the issues that come up.
  • [Fixed] Killed a few slow queries that brought us to our knees.
  • [Improvement] Changed our mysql replication strategy to reduce IO.
  • [Improvement] Decoupled our message queues so TF Core and TF SDK, isolating failures.
  • [Improvement] Decoupled data source dependent features (SDK Debugger, Activity Feed, TestFlight Live). Decoupling these introduces some isolated failures.

No More Failed Uploads

Uploads fail for a variety of reasons. Mostly they fail because the upload IPA does not make it past our validation checks. But more recently they started to fail simply because it was taking way too long to upload and the clients browser would give up, or by the time the binary passed through nginx to our front ends, our front ends were overloaded dealing with other things.

The good news is that those problems are now behind us. So, failed uploads due to system reliability should no longer be an issue. If at any point you do have upload errors, please let us know.

Current Focus

With all of the effort and improvements on the SDK front ends we have pushed the bottleneck back to the SDK workers and persistence. A new backend for both the data processing and persistence has been under way for the last couple weeks, and should get integrated this week. If you see some oddities with the site please let us know. We are putting in the effort to make the transition seamless but very much appreciate the heads up in the event we’ve missed something.

Next Focus

Once we have the improvements to the SDK workers and persistence flying we will turn our focus to TestFlight Live. We look forward to resolving the issues and appreciate your patience.

SDK Networking Bug!

There has been quite a few reports of the TestFlight SDK conflicting with AF Networking (https://github.com/AFNetworking/AFNetworking/issues/307). If you are having this issue please send a sample app that can replicate to http://help.testflightapp.com. We’d like to resolve it immediately but so far have not been able to reproduce the issue.

The TestFlight Crew

P.S. We’re hiring! (https://testflightapp.com/jobs/)

When: Sunday, April 30 at 3PM PST (Sunday, April 30 at 22:00:00 UTC)

Estimated downtime: 20 minutes

Continuing our efforts to bring you a faster more stable TestFlight we are migrating the primary TestFlight database to new bare metal hardware.

This system maintenance should result in less than 20 minutes downtime.  The procedure is to lock the current database to prevent writes, swap the host records and DNS, and rejoice.

This beefier setup will also give us additional capacity to begin the next round of improvements that we are making to the overall system architecture and infrastructure.

Thanks,

TestFlight Crew

We usually love Mondays, but today was an exception. We apologize for the downtime. It was unexpected. There was maintenance scheduled for today at 12 AM PST. This should have resulted in minimal service interruption. Unfortunately it turned into a 6 hour recovery of our primary database.  

In short, pacemaker sigkilled MySQL caused massive corruptions (fortunately no data loss). Due to the extended length of downtime the SDK message queues backed up to the point where we OOMd (resulting in SDK data loss). The team worked through the night to correct the issues. It’s become our priority to move towards architecture and infrastructure changes that will prevent this from happening again.

We have been heads down trying to improve the service but need to make sure we communicate with you asap when something goes wrong. We should have done a better job with this today as well.  

Thank you for continuing to help TestFlight grow.  We can’t apologize enough for the issues of late, we are working as fast as we can to resolve them.

Today TestFlight moves to its new home. The new home is a bare metal infrastructure which brings you a faster more scalable TestFlight. In a follow up post we will discuss some of the technical pieces and the decisions made, for now we just wanted to tell you guys that we are moving, and why.

Ghost Hunting

Cloud hosting has some fantastic benefits, we were specifically attached to the cost and the speed at which we could bring up new instances. What we were not attached to was the inconsistent performance and lack of predictability.

Tracking down bottlenecks in the infrastructure felt like hunting ghosts. There were times were the system locked up hard and the only explanation we walked away with was resource contention. We already consumed the majority of the shared resources available so there was nowhere to go but down.

Hello Anchor

While trying to figure out an ideal load balancing situation for TestFlight I stumbled on this article http://www.anchor.com.au/blog/2009/10/load-balancing-at-github-why-ldirectord/.  Turns out the Anchor team helped scale GitHub. After a few conversations with Anchor, it was clear their team was phenomenal fit for TestFlight.

We would love to take credit for everything involved with this migration but the reality is that Anchor brought a wealth of experience to the table. They took our stack, analyzed it and came up with a plan of attack which involved some fantastic technologies only well-bearded individuals should touch.

The New Infrastructure

Once things have settled down we will post a follow up with some more technical details. This new infrastructure will not only help us scale as we grow, but it should provide immediate performance improvements. We hope that everyone is as excited as we are.

The TestFlight crew is growing! If you are interested in working with us, please apply!