

Disney wouldn’t be doing what every other multinational corporation engaging in AI training is doing, which is scraping any and all dataset they can get access to regardless of propriety since arguably ALL data is useful.
There are actually very few ‘big’ model trainers, or at least trainers worth anything.
OpenAI, Anthrophic, xAI, and Google (and formerly Meta) are the big names to investors. You have Mistral in the EU, LG in Korea, the ‘Chinese Dragons’ like Alibaba and Deepseek, a few enterprise niches like Palantir, Cohere, or AI21, Perplexity and such for search, and…
That’s it, mostly?
The vast, vast majority of corporations don’t even finetune. They just use APIs of others and say they’re making ‘AI.’ And you do have a few niches pursuing, say, TTS or imagegen, but the training sets for that are much more specialized.
…And actually, a lot of research and ‘new’ LLMs largely mixes of public datasets (so no need to scrape), synthetically generated data, outputs of other LLMs and/or more specifically formatted stuff. Take this one, which uses 5.5T of completely synthetic tokens:
https://old.reddit.com/r/LocalLLaMA/comments/1p20zry/gigachat3702ba36bpreview/
That, and rumor on the street is the Chinese govt provides the Chinese trainers with a lot of data (since their outputs/quirks are so suspiciously similar).
Hence, ‘scraping the internet’ is not actually the trend folks think it is. On the contrary, Meta seems to have refuted the ‘quantity over quality’ data approach with how hard their Llama 4 models flopped vs. how well Deepseek did. It’s not very efficient, traning models is generally not profitable, and its done less than you think.
Point I’m making, along with just dumping my thinking, is that Disney is a special case.
Their focus is narrow: they want to generate tiktok-style images/videos of their characters, and only their characters. Not code, not, long chats, not spam articles, just that. They have no financial incentive to ‘scrape all the internet’ beyond the excellent archives that already exist; the only temptation is the ‘quick and dirty’ solution of using Sora instead of properly making something themselves.




























I don’t get the analogy… of course a bunch of students using different tools with different inputs will yield different results? But if they use the same model and input at zero temperature, they will, in fact, get the same results, just like any code.
Predictability has never been a strength of ML, of course.
…That’s not really what it’s for. It’s for finding exotic stars in a mass of astronomical data on a budget, or interpoliating pixels in an image, or for identifying cat videos reasonably well. That’s still a useful tool. And the modern extension of getting a glorified autocomplete engine to press some buttons automatically is no different if structured and constrained appropriately.
The obvious problem, among many I see, is that these Tech Bros are selling underbaked… no, not even half cooked agenic systems as sapient magic lamps. Not niche tools for very specific bits of automation. Just look at the language Suleyman is using: