Inconsistency pagiation cursor results

briggs_43 · April 24, 2020, 2:33pm

Hi all, I’ve started looking into the api results & curious why the pagination cursors are inconsistent and don’t make much sense across api endpoints. Details below.

Hitting /games/top?first=100 includes a pagination result which makes perfect sense:

response cursor: eyJzIjoxMDAsImQiOmZhbHNlLCJ0Ijp0cnVlfQ==

which decodes to: {“s”:100,“d”:false,“t”:true}

Perfect. starting point is 100, which is expected.

Hitting /streams?first=100 includes a pagination result which makes no sense and seemingly changes:

example response cursor: eyJiIjp7IkN1cnNvciI6ImV5SnpJam94TVRFNE1UQXVOVFU1TVRNNU56azVNaklzSW1RaU9tWmhiSE5sTENKMElqcDBjblZsZlE9PSJ9LCJhIjp7IkN1cnNvciI6ImV5SnpJam96TmpNd0xqUTJNRFUyT0RFM09EWXhOU3dpWkNJNlptRnNjMlVzSW5RaU9uUnlkV1Y5In19

which decodes to:{“b”:{“Cursor”:“eyJzIjoxMTE4MTAuNTU5MTM5Nzk5MjIsImQiOmZhbHNlLCJ0Ijp0cnVlfQ==”},“a”:{“Cursor”:“eyJzIjozNjMwLjQ2MDU2ODE3ODYxNSwiZCI6ZmFsc2UsInQiOnRydWV9”}}

which further decodes to: {“b”:{“Cursor”:"{“s”:111810.55913979922,“d”:false,“t”:true}"},“a”:{“Cursor”:"{“s”:3630.460568178615,“d”:false,“t”:true}"}}

Any explanation? Why are they different and what is with the seemingly random starting point in the decoded result?

Dist · April 24, 2020, 2:39pm

You don’t need to know the specifics on cursor schema as you shouldn’t be editing them yourself anyway and they only have relevance to the API server itself, not 3rd party devs.

There’s no guarantee of consistency in cursors, and they can and will change at any time and without warning. The only thing that is consistent is how you should use the cursor strings provided to paginate through results.

briggs_43 · April 24, 2020, 2:44pm

Which makes sense…the question is why? Is this specifically done to prevent calling the api multiple times asynchronously for a top 200 or whatever? Each call relies on a previous call instead of doing actual pagination and that’s the only reason I can think of.

Dist · April 24, 2020, 2:49pm

The reasons for this type of pagination haven’t been made public.

The correct way to paginate in Helix is to wait for a request to finish and use the cursor received for the next request, going through only as far as needed for your needs. Parallel requests (such as using multiple requests with different offests as in Kraken) is not supported in Helix.

briggs_43 · April 24, 2020, 3:03pm

Thanks for the quick replies. I was really hoping to be able to call asynchronously. Any idea on whether this will possibly be changed in the future or has a statement been made that this is purposely avoided going forward?

Dist · April 24, 2020, 3:06pm

Being able to make requests asynchronously has been requested by myself and others, so Twitch are aware of what we’d like to be able to do and the uses cases for it, but every indication from them is that they have no plans to change away from their current design choices for pagination in Helix.

briggs_43 · April 24, 2020, 3:08pm

Okay, thanks for the information, I appreciate it!

Keneck · April 25, 2020, 12:04pm

Just have a process running an infinite loop paginating with the cursor and storing the list of unique user ids that are online. Then have a separate process that takes that list of ids and obtains the live data of each channel by querying asynchronously the /streams endpoint filtering by user_id in chunks of 100.

Obviously you need to make sure you have the needed quota to perform the asynchronous calls with Helix. Alternatively you can do them with Kraken for now. You should be able to pull the live data for all the channels in just a few seconds.

Also, you can have multiple cursors running in parallel to detect a streamer that goes live as soon as possible.

BarryCarlyon · April 25, 2020, 12:13pm

Or if you are looking at specific streamers, use Webhooks instead!

Dist · April 25, 2020, 12:20pm

Due to the way pagination works in Helix your method will result in missing many streamers, and results with repeated duplicates, making it inefficient and inaccurate.

Helix is not designed for pagination all the way through the streams endpoint, and as such accuracy of results significantly diminishes after the first 5-10% if you’re attempting to get results for every streamer on the platform.

Keneck · April 25, 2020, 12:39pm

Exactly!

What alternative do you suggest for someone who wants to capture the live data of all the live streamers on Twitch and update that data, let’s say, every minute?

Obviously this system is far from ideal, but with the tools at our disposal I think that’s the best approach. Happy to hear better alternatives though!

Dist · April 25, 2020, 12:53pm

There’s no accurate way to do that, certainly not every minute because the time it takes for you to make a request, wait for a response, use the cursor for the next request, and repeat until completion takes a certain amount of time, there’s no avoiding that. Because of that even if you make a lightweight app that instantly sends a request with the cursor each time it gets a new one it’ll take around 2 to 5 minutes depending on time of day to go through all Helix Get Streams results.

Because the time to page through all results is greater (by a significant margin) than the cache lifetime, it means after the first 5 to 10% of results the streams will be in the <100 viewer range, where an increase/decrease of a single viewer can move a stream up/down potentially dozens of pages in the results. Also as you get further into the results the chances of one of the streams you’ve gone past going offline, or a new stream going live that instantly has more viewers than the point in the results you’re at, will shift all results a lot.

This is something I’ve brought up a lot to some of the Twitch devs, as I myself have done Twitch analytics for ~5 years now, and Helix does hamper some things, but that is by Twitch’s design. Plus it’s worth remembering that the main use case for needing to page through all streams is analytics or research, which the Twitch Developer Agreement requires you to get permission from Twitch to do and enter into a separate agreement, so it’s not like many devs have permission to be collecting mass data anyway.

Keneck · April 25, 2020, 3:20pm

You’re correct. That’s why I suggested having multiple cursors scrolling at the same time infinitely. Honestly the only issue will be that it might take you a few minutes to detect some of the low-tier streamers.
Also, once you have identified them you’ll be able to refresh their data every minute (or at whichever granularity you desire) until they go offline without any issues at all.

Yes, of course. How you utilize the data is a different story.

Dist · April 25, 2020, 3:26pm

Which will also lead to missing out on streams. The way you’re attempting to paginate through Helix is not the intended way of doing things.

Keneck · April 25, 2020, 4:21pm

I’m still waiting to hear a better solution for someone who aims to, for whatever reason, get a snapshot of all the live streamers on Twitch at a given time.

I’m currently seeing less than 1% difference between the total number of streamers I’m capturing with my method and the total number that Twitch reports.

Obviously some streamers take a few minutes to be identified, and some others with viewership close to 0 might go offline without being identified at all, but I can’t think of a better solution. I’m here to learn though so I’m honestly eager to hear other workarounds.

Dist · April 25, 2020, 4:28pm

Best solution is to not do what Helix was not designed to be doing, ie you scraping the entire streams list. Trying to find workarounds isn’t the best way to go about doing things, especially when you’re messing around with trying to re-use cursors in a way to do parallel requests which can/will break in the future, and with the growth of the platform your method will progressively get worse as the number of concurrent streams (and thus pages of results) grows.

If you want accurate data (within reason, due to caching) then you should limit your requests to just the streamers who have opted-in to your service, and just poll their data no faster than once per minute.

Keneck · April 25, 2020, 4:50pm

Yeah, you’re probably right. Thanks for the tips!

system · May 25, 2020, 4:50pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.