Mastodon Degrated Performance

Degraded

Mastodon Degrated Performance

Jan 16 at 11:34pm MST

Affected services

mstdn.ca

Streaming API

Resolved
Jan 22 at 09:19pm MST

After a rather eventless day, we are declaring this event resolved.

Updated
Jan 22 at 03:42am MST

We were very successful tonight in getting the database under control.

It took a gargantuan effort, and we thank everyone in the Discord for all their help. This was a cross-instance effort, and we're so humbled by that.

@RichardNairn, @andrewdemarsh, @quaff, @ThaMunsta, @controlc, among so many others put our brains together.

As it turns out, we were missing a couple of indexes in the (now 364GB, was 480GB) database, autovacuum wasn't cleaning up a few tables frequently enough, and our concurrency (connections) with the database were too high.

All of this grew to a head last Friday, like if there was too much water for a waterfall.

There are about 600,000 jobs left for the instance to process to catch up with the fediverse, so if you're notifications or posts are two hours behind, that's why. That latency is decreasing with time.

We are hopeful that come the morning when our traffic comes back, we're still in great shape.

Updated
Jan 21 at 11:13pm MST

We've made great progress this evening - with web response times returning to normal. Third party apps are also working again.

At this time, the instance has to catch up with what was missed while the database was locked up. It's estimated that we're about five hours behind. This catch-up work will work through the night.

With hope, a final resolution update will be provided tomorrow.

Thank you, again, for your patience as we worked through this complex problem.

Updated
Jan 21 at 09:59pm MST

We are still working through this issue.

Earlier today, we discovered a stuck trigger in the database. Once removed, the database started accepting data again. However, there's a lot of jobs to catch up on, which the instance is doing now.

We're continuing to look at ways to optimize the database.

Updated
Jan 21 at 02:33am MST

Thanks to all that joined the "all-hands" call this evening. We looked at logs, made some changes, and though we ran into a storage limit, the database is now recovering from that.

It's hopeful that when the database has recovered, we will see an improvement in response times.

Updated
Jan 20 at 01:57am MST

We attempted to cut over to the new database server and it didn't go as planned. We're looking at logs and will attempt again.

Updated
Jan 20 at 12:16am MST

The new database server is nearly ready - all data has been copied, however it's still sorting out and organizing all the data. (PostgreSQL VACUUM)

We expect to test the switch over sometime overnight. Thank you all for your patience.

Updated
Jan 19 at 07:14am MST

All indications show that we have fully ingested the old server into the new one, however we want to give the new server some time to stabilize.

We will attempt a cutover this evening at an undeclared time as to prevent a rush of traffic. The outcome of this operation will be declared here.

Note that this is a major operation and it's not being taken lightly.

Updated
Jan 19 at 06:59am MST

We're getting very close to resolving this - the new database server has received nearly all of the data from the old server.

We're giving it some breathing time to make sure that it has organized all this data and is ready for prime time.

If everything goes to plan, we will take a brief downtime this evening to switch over to the new database.

Thank you again for your patience.

Updated
Jan 19 at 02:43am MST

We're getting really close to what we hope is resolution.

We've started a live stream to show where the transfer is - until the top right status screen shows less than 5Mb of transfer, we are still in replication. Once we're flatlined, we will be ready to switch over to the new server.

https://www.youtube.com/live/SMpPe3wy1-k

Thank you again for your patience.

Updated
Jan 19 at 01:41am MST

After the initial database import, the new database server is now working through making sure everything matches between the old server and new.

Once this is complete, we'll switch over and the hope is, our latency issues will be resolved and we'll be snappier than ever.

The replication has been going for about three hours, and being it's transferred 36GB, I feel we're about three hours away from it being done. So - with hope - we should be able to take a brief downtime, change configuration files, and switch over to the new database tomorrow afternoon.

Thanks all for your patience. I know this weekend has been difficult.

Updated
Jan 18 at 10:47pm MST

Initial ingest into the new database server is complete. We're now waiting for the new server to catch up with anything that needs to be replicated since our point-in-time data dump.

Updated
Jan 17 at 11:08pm MST

The initial ingest into the new database server is still underway. We're monitoring its process closely and will share more information as it becomes available.

Updated
Jan 17 at 01:24pm MST

The initial database transfer is still being ingested by the new server.

We recognize that this is challenging for our community and thank you for your patience.

Updated
Jan 17 at 03:45am MST

We have set the gears in motion to move the PostgreSQL database away from DigitalOcean, where the latency is.

This database is ~64GB, and is currently being ingested to a new database local to the rest of our resources.

This process may take some time, and we appreciate your patience.

Updated
Jan 17 at 02:57am MST

We've noted that latency has come down overnight, however we are still working on putting in a permanent fix. This requires processing over 66GB of data.

We will continue to update as more information arises. Thank you for your patience!

Created
Jan 16 at 11:34pm MST

We are aware of degrated performance when accessing the instance. We believe to have found the cause and are working to resolve it.

Thank you for your patience!