Machine Learning, Artificial Intelligence, or whatever new term TechCrunch decides to circulate this week is something you should be highly skeptical of. It's not because it isn't real, or that it isn't highly valuable to some companies successes. You should be skeptical of it because as with any buzz word or hype, most of the claims are bullshit. The proof is in the pudding. It always has been.
Now separating the wheat from the chaff has always been a difficult task with technology based startups. After all, most start ups are pitching what they will become, not what they are. It's not as simple as sending your product to a independent reviewer and letting nature take its course. What makes Machine Learning particularly insidious, however, is that it's generally not any good upon delivery. In fact, it's part of the story that it gets better over time as more people interact with it. That's why the valuations are so high for these startups. If they somehow get some energy into the flywheel, all else held constant, the product will get more valuable over time.
So given this, how can you tell if a company actually understands how to build with algorithms? How can you tell if they're blowing smoke? First let's consider fundamentally how algorithms are developed and used in products.
At the end of the day almost every algorithm can fit into the following framework:
You have some model of the world, you get the world to interact with it, the world produces data about whether the model is achieving its goal or not, you learn from that data and produce a better model.
That's it. That's all machine learning is.
Now given this framework, there are only three ways to improve the quality of the algorithm being produced:
- 💰Get more, or higher quality data
- 🏃♀️Iterate on your model more quickly
- 👨🎓Use more sophisticated math
If you're evaluating a company, or starting your own and you want data to be your competitive advantage you should be asking yourself, does this company have a strategic advantage in any of these areas?
More data is pretty obvious. If your product generates data 10x quicker than your competitors. You therefore can learn more quickly than your competitors and so your product will be used more. In all, you'll accelerate ahead of the pack. This is why data based companies tend to be winner take all. Until they don't, but that's for another post.
The more overlooked aspect, however, is the quality of the data. Some people call it signal versus noise ratio, but the underlying idea is behind data quality comes down to what happens in the learning process of the aforementioned framework. Algorithms learn from examples of when they were wrong and when they were right. More can be learned from a strong positive example versus a weak positive example.
So, let's say you're a retail company, someone buys a product that’s a stronger signal than if someone just viewed its product detail page. Similarly if someone got something for free and returned it that's a much stronger negative signal than if they didn't buy it in the first place.
If your product collects data quicker, or collects higher signal data, you're in a good spot.
For context, my colleague at Stitch Fix, Chris Moody frequently poses this thought experiment around Stitch Fix data.
Imagine if Spotify were to generate personalized playlists for you having only listened to 25 songs a year.
Our customers provide us with very high signal data. We don't need to collect a lot of it as a result.
Another question to ask yourself when you're meditating on this topic,
Can Amazon recreate my dataset purely because they're at a bigger scale?
For instance, if you're starting a news recommendation startup, if you're only using click-through rates from your news aggregator, you're going to have a tough time winning. The New York Times gets 1000x traffic and collects the same data.
This point basically comes down to how smart are your data scientists? Do you have proprietary methods or research?
For the most part, the math behind algorithms and data science departments is relatively open. The reason for this is that it doesn't really matter if you have the latest and greatest methods if you don't have the data to train on. Therefore, Facebook can release papers about how they rank the news feed because even with the most sophisticated math, no one can beat them on volume of data.
This is a great aspect of the industry. For the most part, research is open.
Given that, most companies would be categorized as applied machine learning. Are they able to take the latest research, collect the appropriate data, and implement it on their unique dataset with positive results? This requires smart people. So when analyzing these companies you must ask, do they have the talent to keep up and apply the latest research?
Ultimately the quality of the product is a function of how much you're able to learn on each turn of that wheel. If you can test more models serially, you can compound your knowledge more quickly. This is why DeepMind can beat the best players in Go and why OpenAI can beat the best players in DOTA. Its reality for those problems is a simulated environment, therefore it can self play and speed up time. AlphaGo Zero was able to play 4.9 million games against itself (turns of the wheel) within 3 days to learn how to become superhuman. Whereas self driving cars generally release new models once a month.
I realize this is an apples to oranges comparison, but that's kind of the point. Are you an apple or an orange?
This is the fundamental nature of algorithms. Each algorithmic problem requires a unique strategic advantage in at least one of these three domains. One thing is certain, if they claim their core competency is AI and they don't have a strategic advantage in one of these three domains, they're lying to you, or they haven't lost yet.