Note: CPU references in this post are all to Intel CPU. Other CPU families took similar paths but did so with different timelines and trade-offs (e.g. the inclusion of FPU and/or MMU functionality in the CPU).
First, a historical ramble…
What follows is accurate enough for what follows…
Much as with so much on the web, Moore’s Law had a specific origin but has been through a number of updates/revisions/extensions to remain relevant to those who want it to remain relevant. Originally, it was about the number of transistors that could be built into a single semiconductor product. Presumably that number got awfully large and was meaningless to most people (transistor?), so Moore’s Law was sort of retooled to refer to compute capability (MIPS, FLOPS) or application performance (frames per second (in a 3D video game), TPC-* (for databases), etc. If your widget was getting faster, then there was “an [Moore’s Law] for that” (to paraphrase Apple). And Moore saw and he was pleased.
But really all the faster-being was, of course, under pinned by the various dimensions of scaling for semiconductors. Processors (the things most people care about the most) are made using MOSFETs (a very common type of transistor used to build processors/logic, but a bit different those in the original Moore’s Law) and Robert Dennard wrote a paper noting that MOSFETs have particular scaling properties. See Dennard Scaling: “[if] transistor dimensions could be scaled by 30% (0.7x) every technology generation, thus reducing their area by 50%. This would reduce circuit delays by 30% (0.7x) and therefore increase operating frequency by about 40% (1.4x). Finally, to keep the electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x frequency) by 50%”. This was also known as “triple scaling” as it implied that three scaling factors would simultaneously improve: geometry decrease (density), frequency increase and power decrease (for equivalent functionality).
As Dennard scaling started to decrease  due to the effects of smaller geometries (leakage current increased so that smaller circuits “leaked” power and power leakage started to degrade the overall power benefits), to an inability to continue lowering voltage (again, degrading power improvements) and to frequency stabilization (signals still have to propagate across some distance and smaller transistors had a harder time doing so; your 2020 CPU isn’t going to be much higher frequency than your old 2010 processor) the focus moved on to multi-core processors/systems and to heterogeneous computing.
Multi-core systems came about as an increasing number of transistors was available but diminishing returns affected clockspeeds and raw instructions-per-cycle. L1, L2 and L3 caches were increased until they, too, produced diminishing returns. While systems had been multi-processor for a while, processors became multi-core as per-core performance started to flatten (with languages to catch up a bit later) and “excess” transistors were available with each product in an upgraded product line..
Heterogeneous computing refers to the CPU offloading specialized compute tasks to associated specialized logic blocks or coprocessors. In earlier days, this was often floating point processing: an optional 8087 FPU (floating point math) could optionally be installed alongside the 8088 CPU (memory access + logic + integer calculation). Then the FPU was bolted in to the x86 with the 80486. Then a common coprocessor use was for high performance networking, allowing commodity CPUs to offload network processing and to recover CPU cycles. Then FPGAs placed next to CPUs (in specialized cases) to provide highly customizable compute acceleration (e.g. video or audio encode/decode, Bitcoin mining, etc). (The niftiest application, IMHO, was the now defunct Leopard Logic (see also).) GPUs were also generally available so are also used as general purpose, though specialized, coprocessors. While the many variations on coprocessors could improve computational performance by peeling off ever larger chunks of traditional workloads, they, too, are limited by Dennard Scaling and by computational limits (e.g. a complete, optimized 32-bit FMA may not be able to be optimized further using additional transistors). GPUs temporarily circumvented this limitation by including ever more of their relatively simple compute units but the utility of further parallel units will diminish. Various companies are now pursuing specialized coprocessors for AI (see Google’s TPU).
In any case, we actually do appear to be reaching the end of Moore’s Law by continuously shrinking process geometries.
And now the main ramble…
What follows is utter speculation…
Are there ways to keep going with scaling? Perhaps 3D chip technologies can help but these, when used for logic, are tremendously limited by power dissipation (without a process shrink, 50% more transistors ~= 50% power). Other highly speculative techniques maybe produce non-CMOS-based logics.
One area that hasn’t received much attention is utilization (at least w.r.t. scaling). The hyperscalers are certainly paying attention to this factor but a significant (vast?) majority of the processors/co-processors sold worldwide run at very low utilization. Your mobile phone is powered down as much as possible (even if the screen is on); your laptop constantly cycles its clock frequency down in order to reduce power draw (and to avoid spinning the fan (more power)). So, around the world, huge amounts of processing power are sitting idle (on our phones, laptops, TVs and desks).
So what happens if we leave as little functionality on a phone or laptop as possible and leave the “computing” to the hyperscalers? We’re back to the good old days of thin-clients and thick-servers (sorta like “big iron” but more like heaps of “iron dust”). Assuming 75% of the world’s compute power is nearly un-utilized (~2.5%) and 25% is heavily utilized (I’ve heard from hyperscalers that 33% utilization is a good average for hyperscalers), then we’re at about 8% utilization for global compute. If “user” compute power (laptop, phone, etc) is constant (which it kinda is…) but demand for global compute continues rising, then hyperscalers have a significant incentive to increase utilization because additional compute is demanded and improved utilization has roughly no marginal cost (who doesn’t like free money?).
If the above obtains and compute deployment shifts and utilization improves, we might wind up with 25% of the world’s compute underutilized @ 10% and 75% of the world’s compute more heavily utilized @ 50%. Then global compute utilization would sit at roughly (0.25*10% + 0.75*50%) 40%. If this shift were achieved over 3-5 years, that would represent a 7-12% annual increase in free compute improvements with no process improvements. Certainly not Moore’s Law’s ~45% increase and the improvement isn’t that visible to users but, hey!, free faster-compute is still a good thing.
 These notes on the end of Dennard Scaling are probably 75% accurate (it’s more complicated; some bit kept scaling; etc.) but that doesn’t really matter for this post; Dennard scaling is dead or greatly slowed.