In short, performance has improved but we still have some work ahead of us. Specifically, TestFlight should be much faster but TF SDK / TF Live improvements are still in progress.
Below is a list of completed improvements and upcoming items.
New Faster, Simpler SDK front ends
We are primarily a python shop, in wanting to stick with what works and add some simplicity to our stack we built a raw gevent based server. These servers depend on a local redis instance that slaves off our redis cache master. The complete server role looks like stud -> nginx -> gevent -> local redis. TestFlight core updates the cache master and redis takes care of synchronizing the front end caches. We are really pleased with this approach. It has decoupled the SDK from Web nicely and it has given us the ability to burst to the cloud when necessary.
For developers interested in other details here is a hit list of some of the less obvious things we ran into:
- [Improvement] nginx did not handle our SSL traffic, moving to Stud showed significant improvements.
- [Improvement] one of our bare metal SDK boxes keeps up 4 SDK VMS. Bare-metal ftw!
- [Cool] “fab deploy.sdk:production,git tag”, now pings a deploy server which packages the git tag and each SDK front end periodically checks in with the deploy server to get the latest package.
Faster TF Core
TestFlight users are first and foremost distributing betas. Not only do we take pride building tools developers love to use, we also strive to provide a seamless experience for your testers and stakeholders. In the last few weeks we have made strides in improving the reliability and performance of TestFlight core.
The addition of the new SDK front ends isolated SDK web traffic and database read traffic. This has had the biggest impact on the perceived performance of TF core. Here are are a few of the other things we addressed:
- [Fixed] Turns out our continuous deploy process since migrating our infrastructure caused failed requests . Formerly we had gunicorn operating on two ports behind nginx, so nginx masked this issue by kindly directing the request to the second port. We’ve resolved this.
- [Improvement] Switched from meinheld to gevent as our gunicorn worker class. This improved performance by virtue of it having more consistent behavior and better logging so we could actually see the issues that come up.
- [Fixed] Killed a few slow queries that brought us to our knees.
- [Improvement] Changed our mysql replication strategy to reduce IO.
- [Improvement] Decoupled our message queues so TF Core and TF SDK, isolating failures.
- [Improvement] Decoupled data source dependent features (SDK Debugger, Activity Feed, TestFlight Live). Decoupling these introduces some isolated failures.
No More Failed Uploads
Uploads fail for a variety of reasons. Mostly they fail because the upload IPA does not make it past our validation checks. But more recently they started to fail simply because it was taking way too long to upload and the clients browser would give up, or by the time the binary passed through nginx to our front ends, our front ends were overloaded dealing with other things.
The good news is that those problems are now behind us. So, failed uploads due to system reliability should no longer be an issue. If at any point you do have upload errors, please let us know.
With all of the effort and improvements on the SDK front ends we have pushed the bottleneck back to the SDK workers and persistence. A new backend for both the data processing and persistence has been under way for the last couple weeks, and should get integrated this week. If you see some oddities with the site please let us know. We are putting in the effort to make the transition seamless but very much appreciate the heads up in the event we’ve missed something.
Once we have the improvements to the SDK workers and persistence flying we will turn our focus to TestFlight Live. We look forward to resolving the issues and appreciate your patience.
SDK Networking Bug!
There has been quite a few reports of the TestFlight SDK conflicting with AF Networking (https://github.com/AFNetworking/AFNetworking/issues/307). If you are having this issue please send a sample app that can replicate to http://help.testflightapp.com. We’d like to resolve it immediately but so far have not been able to reproduce the issue.
The TestFlight Crew
P.S. We’re hiring! (https://testflightapp.com/jobs/)