PSA: Chat servers are going to migrate to AWS EC2 servers

We’ve spent considerable amount of time over the last half year preparing to migrate our chat system to AWS EC2 servers. We’re finally ready to start the transition (and identify any misconfigurations / new scaling concerns in the new environment).

As this will affect third-party developers significantly, I want to

a) make you aware
b) ask for suggestions to make this transition less painful for you

Our current rollout plans are as follows:

  • Migrate channels one-by-one on a whitelist basis, partners / staff who opt-in to the risks of using the new cluster

  • As our confidence increases that things are working correctly, slowly add more Twitch partners to the whitelist

  • Once all Twitch partners are on the whitelist, transition all channels to use the new servers

Currently, in order to determine what cluster and servers to connect to, you can use http://tmi.twitch.tv/servers?channel=brildum. You’ll notice that it returns that my channel’s cluster == “aws” and the servers list no longer uses IP, but DNS: “irc.chat.twitch.tv” (for raw tcp) and “irc-ws.chat.twitch.tv” (for websockets).

I recognize that for many of you, querying the servers list API on initialization of your bot may be a burden (particularly for very large bots). We could consider returning an invalid cluster message in response to JOIN if you’ve joined a channel on the wrong cluster.

Another note that you may have noticed, is that I did not mention anything about event chat. Our hope is that moving to EC2 will allow us to maintain a single chat cluster for all channels. This is a hypothesis we’re planning to test before we migrate the very largest channels to the new servers, so hopefully that works out.

Other items of interest:

8 Likes

Does this affect group chat as well?

This does not affect group chat, I will edit my initial post to make that clear. Thanks.

Whats the time frame for this to start occurring?

There are a handful (~10) of staff accounts which are on the new cluster this weekend. We’re hoping to compile a list of issues, resolve them, and start reaching out to partners early next week to see if any are willing to opt-in to test the new servers.

I suspect this week it will happen fairly slow, probably less than 50 channels in total and probably not all at once. The following week we’ll probably move significantly faster – but this all depends on what issues we find and how difficult they are to resolve.

If you move a channel, will that channel get a “chat is restarting” packet like normal when you guys chat restart?

That currently won’t happen – but one of my goals for this thread to identify any changes we should make to support bots deal with this transition easier.

FWIW, assuming our script works, we shouldn’t be transitioning a channel’s chat cluster while they are actively streaming, only while they are offline.

The people opting in will just simply then not be able to use the bot. Having things on separate clusters just creates a ton of issues, especially when you have to query it on a channel-basis, for hundreds of thousands of channels. And that’s not even taking into account the amount of code that must be in place to manage two different clusters. (it’s a ton)

More importantly, what’s the timeframe here? How long will this be in place? And are you saying the partners will not be able to opt-out?

The goal will be to get to a single cluster as quickly as possible. We don’t want to distribute the load to the new servers too quickly, or we risk the stability of chat. I’d guess it’d take 2 weeks to migrate all channels.

At some point every channel is moving to these new servers (you can’t opt-out), though at that point we hope there is only a single cluster.

1 Like

If we try to connect to the IP like we do now will we get anything in the response that indicates they’ve been moved or just a failed login?

You should refer to the link that @brildum posted to check which server you should connect to for your target channel(s) since I imagine connecting to the new IP just results in no chat for the channels you join

Yah, just seems like a waste to make an extra call right now when the majority of the channels aren’t being moved yet.

shrug my bots do the call at boot and pick one of the available servers… So makes no real difference to me (that said mine are generally single channel kids)

Yah, not really a big deal. Just hoping it all goes quickly.

That could possibly work, as polling every X minutes for each channel whether or not they’re on the AWS cluster is pretty much impossible.

Will a channel that is put on the AWS cluster ever go back to the old cluster though? If so, that’d be problematic.

1 Like

Maybe a compromise? An IRC command that tells you the optimal IP to connect to?

Connect to a regular server

USER > SERVER: INQUIRE #channel (or w/e you want the command to be)
SERVER < USER: IP|DNS
APPLICATION: (Checks if IP/DNS is correct, if not switches connection.)

Here is a proposal:

:tmi.twitch.tv SERVERCHANGE #some_channel

is sent under 2 conditions:

  1. when a channel’s cluster is changed (this allows already-connected bots to transition)
  2. in response to a JOIN for a channel on the wrong cluster

In both cases, clients simply need to go through the reconnect flow:

  1. Fetch servers list from API
  2. Connect / Auth / Join
4 Likes

Sounds perfect to me.

What is this, JTVIRC all over again? remembers the horror that was JTVIRC server REDIRECTs

Joking aside, as long as this is just temporary I would be fine with logic like this. Just include in the SERVERCHANGE command the cluster it’s swapping to (and maybe change the command to CLUSTERCHANGE, because I sure do hope we’re not going back to channels being on individual servers [a JTVIRC horror story in the making]). Ideally I could have all clusters open such that I could swap at whim between them without needing to fetch server lists per channel when given a cluster change command.

Regarding mass move when testing is complete, please do so in small batches such that JOIN/PART IP rate limits are not a problem. Thanks.

Sounds good.

Personally I’d rather have an on/off-move for the majority of users, instead of slowly moving everyone over to what is a temporary system.