With a cutoff of five, I would be choosing a random option for about one in every 20 decisions I made with my algorithm. I picked five as the cutoff because it seemed like a reasonable frequency for occasional randomness. For go-getters, there are further optimization processes for deciding what cutoff to use, or even changing the cutoff value as learning continues. Your best bet is often to try some values and see which is the most effective. Reinforcement learning algorithms sometimes take random actions because they rely on past experience. Always selecting the predicted best option could mean missing out on a better choice that’s never been tried before.
I doubted that this algorithm would truly improve my life. But the optimization framework, backed up by mathematical proofs, peer-reviewed papers, and billions in Silicon Valley revenues, made so much sense to me. How, exactly, would it fall apart in practice?
The first decision? Whether to get up at 8:30 like I’d planned. I turned my alarm off, opened the RNG, and held my breath as it spun and spit out … a 9!
Now the big question: In the past, has sleeping in or getting up on time produced more preferable results for me? My intuition screamed that I should skip any reasoning and just sleep in, but for the sake of fairness, I tried to ignore it and tally up my hazy memories of morning snoozes. The joy of staying in bed was greater than that of an unhurried weekend morning, I decided, as long as I didn’t miss anything important.
I had a group project meeting in the morning and some machine learning reading to finish before it started (“Bayesian Deep Learning via Subnetwork Inference,” anyone?), so I couldn’t sleep for long. The RNG instructed me to decide based on previous experience whether to skip the meeting; I opted to attend. To decide whether to do my reading, I rolled again and got a 5, meaning I would choose randomly between doing the reading and skipping it.
It was such a small decision, but I was surprisingly nervous as I prepared to roll another random number on my phone. If I got a 50 or lower, I would skip the reading to honor the “exploration” component of the decision-making algorithm, but I didn’t really want to. Apparently, shirking your reading is only fun when you do it on purpose.
I pressed the GENERATE button.
65. I would read after all.
I wrote out a list of options for how to spend the swath of free time I now faced. I could walk to a distant café I’d been wanting to try, call home, start some schoolwork, look at PhD programs to apply to, go down an irrelevant internet rabbit hole, or take a nap. A high number came out of the RNG—I would need to make a data-driven decision about what to do.
This was the day’s first decision more complicated than yes or no, and the moment I began puzzling over how “preferable” each option was, it became clear that I had no way to make an accurate estimation. When an AI agent following an algorithm like mine makes decisions, computer scientists have already told it what qualifies as “preferable.” They translate what the agent experiences into a reward score, which the AI then tries to maximize, like “time survived in a video game” or “money earned on the stock market.” Reward functions can be tricky to define, though. An intelligent cleaning robot is a classic example. If you instruct the robot to simply maximize pieces of trash thrown away, it could learn to knock over the trash can and put the same trash away again to increase its score.