Well, without giving away the secret sauce, I can kinda outline what I'm doing (I have a SaaS built around the concept, and I'm kiinda proud of it). Basically, Twitter's Streaming API gives you a random
sample of the total minute-to-minute tweets that amounts to approximately 1% of all tweets. This will net you ~3.4 mil (+/- 0.2 mil) tweets daily if you decide to aggregate them.
The official estimate for daily tweets is ~500 mil, but I've found that number to be slightly inflated, and the actual number varies greatly. Anyway, I scaled up the operation to net ~5% of all tweets (with plans to hit 10% soon). This ~16mil daily tweets doesn't even account for my favorite part of the operation: historical data! That's the stuff that's actually worth money. I'm pulling 3x the amount of historical data as my streaming data.
There are tons of tricks and workarounds you have to put in place to collect that much data without pissing them off, I wouldn't suggest going after it unless you have plenty of patience and programming chops. On the plus side, there's enough room in this industry for many more like myself to undercut the big guys.
Last thing I'll say: you'll be amazed at what you can do with this data, too. There are strong correlations between frequency of hashtags/keywords and crime, death tolls, cryptocurrency prices, stocks, and who knows what else.
I feel like I just wrote a novel. I get excited about data :)
Hope this clarifies.