Concepts and Terms
14. Advanced Concepts
Process Technology
- Process node - Technology generation (7nm, 5nm, 2nm, etc)
- Node - Short for process node
- Scaling - Making features smaller
- Moore's Law - Observation that transistor count doubles every ~2 years
- Dennard scaling - Power density stays constant as transistors shrink (broke down ~2005)
Design & Architecture
- ASIC (Application-Specific Integrated Circuit) - Custom chip for specific use
- FPGA (Field-Programmable Gate Array) - Reconfigurable chip
- SoC (System-on-Chip) - Complete system integrated on one chip
- IP (Intellectual Property) - Reusable design blocks
- Standard cell - Pre-designed logic gate building block
- Place and route - Arranging and connecting transistors in design
Memory
- SRAM (Static RAM) - Fast memory using 6 transistors per bit
- DRAM (Dynamic RAM) - Dense memory using 1 transistor + capacitor per bit
- Cache - Fast memory close to processor
- HBM (High-Bandwidth Memory) - Stacked DRAM with wide interface
- Memory hierarchy - Different levels of memory (cache, RAM, storage)
- Memory wall - Bottleneck where memory speed limits compute
AI Accelerator Concepts
- Tensor Core - Specialized unit for matrix operations
- INT8 - 8-bit integer arithmetic (fast, lower precision)
- FP16 - 16-bit floating point (half precision)
- FP32 - 32-bit floating point (single precision)
- Mixed precision - Using different precisions for different operations
- Training - Teaching AI model (compute-intensive)
- Inference - Running trained model (less intensive)
- Sparsity - Many zeros in data; can skip computation
Speech Content
Process nodes, scaling, memory hierarchies, and AI accelerators. Let's explore these advanced semiconductor concepts that define the cutting edge of chip design and manufacturing.
Introduction to Core Concepts
We'll cover process technology nodes and Moore's Law, the design ecosystem of ASICs and FPGAs, memory architectures from SRAM to high bandwidth memory, and specialized AI accelerators with tensor cores and mixed precision arithmetic. These concepts represent where semiconductor economics, physics limits, and computational demands collide.
Process Technology Deep Dive
Let's start with process nodes. When you hear "5 nanometer" or "3 nanometer," you might think this refers to the actual size of transistor features. It doesn't anymore. Modern process node names are essentially marketing terms. A so-called 5 nanometer node actually has transistor gates around 24 nanometers long. The node name instead references equivalent transistor density compared to historical nodes where the numbers actually matched physical dimensions.
The physics of scaling has become extraordinarily challenging. As features shrink below 10 nanometers, quantum tunneling becomes significant. Electrons can tunnel through gate oxides that are less than 1 nanometer thick. This creates leakage current even when transistors are supposed to be off. Short channel effects mean the gate loses electrostatic control over the channel. Line edge roughness, tiny imperfections in pattern edges, causes massive variability when features are only dozens of atoms wide.
To combat these issues, the industry moved from planar transistors to FinFETs around the 14 nanometer node. FinFETs wrap the gate around three sides of a vertical silicon fin, providing better control. At 3 nanometers and beyond, we're seeing Gate-All-Around transistors, also called nanosheets or nanowires, where the gate completely surrounds the channel. Samsung and TSMC are manufacturing these now in 20 23 and 20 24.
The key enabler for sub-7 nanometer nodes is extreme ultraviolet lithography, or EUV, operating at 13.5 nanometer wavelength. ASML in the Netherlands has a complete monopoly on EUV scanners, which cost around 150 million dollars each and represent some of the most complex machines ever built. High numerical aperture EUV with zero point five five NA versus the current zero point three three is coming for 2 nanometer and beyond, enabling even finer features but requiring massive infrastructure investment.
Now let's talk about Moore's Law and Dennard scaling. Moore's Law, the observation that transistor density doubles roughly every 18 to 24 months, continues but at enormous cost. A leading-edge 3 nanometer fab costs 15 to 20 billion dollars. Mask sets at advanced nodes run 5 to 30 million dollars. Only three companies, TSMC, Samsung, and Intel, are even attempting to stay at the leading edge.
Dennard scaling is even more important to understand. It predicted that as transistors shrink, power density stays constant because you can reduce voltage proportionally. This broke down around 20 05 because voltage can't scale below certain thresholds without excessive leakage and threshold voltage constraints. This caused the power wall, forcing the industry into multi-core processors and what's called dark silicon, where parts of the chip must stay powered off to remain within thermal limits.
For building a lunar semiconductor industry, process nodes present interesting tradeoffs. The moon's ultra-high vacuum is naturally cleaner than any Earth cleanroom, potentially improving material deposition and reducing contamination. However, lower gravity affects chemical vapor deposition flow dynamics. Cosmic radiation and solar particles require radiation-hardened designs. Silicon-on-insulator substrates help with this. The limited materials palette on the moon might necessitate simpler process flows, suggesting older nodes like 65 nanometer or novel architectures that optimize for available resources.
For competing with TSMC from the West, the challenge is stark. ASML's monopoly on EUV is actually somewhat beneficial since they're based in the Netherlands and subject to Western export controls. US companies Applied Materials and Lam Research dominate deposition and etch equipment. The opportunity lies in novel approaches: using AI for process optimization, building specialized facilities for trailing nodes like 28 nanometer which are extremely mature and profitable, or focusing on 3D integration to extend density gains without requiring the latest lithography. Recruiting talent means looking beyond traditional semiconductor engineers to AI PhDs who understand physics and optimization.
Design and Architecture Landscape
Moving to design and architecture, let's distinguish ASICs from FPGAs. An Application-Specific Integrated Circuit or ASIC is a custom chip designed for one specific function. Every transistor is placed deliberately through a process called place-and-route. This uses sophisticated algorithms, often simulated annealing or analytical methods, to position standard cells and route metal interconnects between them.
Standard cells are pre-designed logic gate building blocks: NAND gates, NOR gates, flip-flops, and more complex functions. These are designed once per process node and extensively characterized for timing and power consumption. Companies like ARM, Synopsys, and Cadence provide these libraries. The designer specifies functionality in a hardware description language, which gets synthesized into a netlist of standard cells, then placed and routed to create the final layout.
FPGAs, or Field-Programmable Gate Arrays, take the opposite approach. They're pre-fabricated chips with arrays of logic blocks, usually lookup tables or LUTs, and programmable interconnects controlled by SRAM cells. You program the chip after manufacturing to implement your desired function. This flexibility comes at a huge cost: FPGAs are typically 10 to 100 times less efficient in area, speed, and power compared to an ASIC implementing the same function. But you can reprogram them, and you don't need to spend millions on mask sets.
System-on-Chip or SoC designs integrate everything: CPU, GPU, neural processing unit, memory controllers, input-output blocks, all on one die. Modern smartphone SoCs like Apple's A-series or Qualcomm's Snapdragon exemplify this. These designs rely heavily on IP blocks, reusable intellectual property like ARM CPU cores, which are licensed for millions to hundreds of millions of dollars plus royalties.
The EDA industry, Electronic Design Automation, is dominated by Synopsys and Cadence with combined revenues around 10 billion dollars. There's significant opportunity in open-source approaches. SkyWater has open-sourced a 130 nanometer process design kit. RISC-V provides an open instruction set architecture enabling custom accelerators without licensing fees.
A fascinating development is using AI for chip design itself. Google's DeepMind demonstrated using reinforcement learning for floorplanning, reducing design time from months to hours. This is a massive opportunity: applying AI to optimize placement, routing, and timing closure.
For lunar and Western fabs, chiplet architectures become critical. Instead of one monolithic die, you design multiple smaller chiplets manufactured potentially at different nodes or even different fabs, then integrate them through advanced packaging. The Universal Chiplet Interconnect Express or UCIe standard is emerging to enable this. This allows specialization: manufacture compute chiplets at 3 nanometer, input-output chiplets at mature 28 nanometer, memory chiplets optimized for density. Cold welding in vacuum, where atomically clean metal surfaces bond at room temperature, could revolutionize die-to-die bonding on the moon without thermal stress.
Memory Architectures and the Memory Wall
Now let's dive deep into memory. SRAM, or Static RAM, is the fastest memory type. Each bit requires six transistors configured as cross-coupled inverters that hold state without refresh. It's expensive in terms of area, consuming around 100 F squared per bit, where F is the feature size. This is why cache sizes are limited. L1 caches might be 32 to 64 kilobytes, L2 caches 256 kilobytes to a few megabytes, L3 caches up to 32 to 128 megabytes on high-end processors.
SRAM scaling faces serious challenges. As transistors shrink and vary more due to atomic-scale randomness, the minimum voltage for stable operation increases, limiting power benefits. Alternative SRAM cells like 8T or 10T designs improve stability but consume more area.
DRAM, Dynamic RAM, achieves much higher density using just one transistor and one capacitor per bit, around 6 F squared per bit. The capacitor holds a charge representing the bit value, but it leaks, requiring refresh every 64 milliseconds. Modern DRAM uses trench or stacked capacitors with high-k dielectrics like zirconium oxide and aluminum oxide to maintain sufficient capacitance as dimensions shrink. Reading is destructive, so every read requires writing the value back.
DRAM scaling is hitting fundamental limits. Capacitance is proportional to area divided by dielectric thickness. As you shrink area, maintaining adequate charge becomes harder. Leakage through thinner access transistors increases. Current DDR5 operates at 6,400 megatransfers per second, but latency hasn't improved much.
The memory hierarchy is designed around these tradeoffs: CPU registers accessible in one cycle, L1 cache in about 4 cycles, L2 in about 12 cycles, L3 in about 40 cycles, main DRAM in about 200 cycles, SSD storage in about 100,000 cycles. This creates the memory wall bottleneck. Compute performance historically grew 59 percent per year from 19 86 to 20 04, while DRAM latency improved just 1.1 times per decade.
High Bandwidth Memory or HBM addresses bandwidth constraints through radical packaging innovation. HBM stacks 8 to 16 DRAM dies vertically using through-silicon vias, or TSVs, which are vertical conductors about 5 microns in diameter on 50 micron pitch passing through the silicon. This creates a 1,024 bit wide interface per stack. HBM3 delivers over 800 gigabytes per second per stack.
Manufacturing HBM is extremely challenging. Dies must be thinned to 40 microns without breaking. Each die must be known-good before stacking since yield multiplies across the stack. Micro-bump bonding with precise alignment is required. This makes HBM expensive, but it's essential for AI accelerators that are memory-bandwidth limited.
The memory industry is dominated by Samsung, SK Hynix, and Micron. It's highly cyclical and capital-intensive. HBM supply is currently constrained with gross margins around 20 percent versus 40 percent for high-end DRAM. There's huge opportunity in emerging non-volatile memories like magnetoresistive RAM or MRAM, and ferroelectric RAM or FeRAM, which could replace SRAM for low-power applications. Compute-in-memory architectures that perform operations directly in the memory array are also promising.
For lunar applications, DRAM presents challenges. The precise capacitor dielectric deposition requires atomic layer deposition with tight control. However, vacuum processing eliminates native oxide before dielectric deposition, potentially improving interface quality. Radiation causes bit flips, requiring more error-correcting code overhead. HBM assembly in vacuum could enable novel bonding without coefficient of thermal expansion mismatch concerns from atmospheric pressure changes.
For Western fabs, US DRAM capabilities have largely atrophied except for Micron. The big opportunity is in emerging memories where the game hasn't been decided. Intel and Micron's 3D XPoint was discontinued but the technology remains viable. Spin-transfer torque MRAM companies like Everspin are growing. Startups exploring compute-in-memory include Mythic with analog computing and Syntiant with analog neural networks. AI can help here too: using generative models to explore novel SRAM cell topologies, optimizing the tradeoff between area, stability, and power.
AI Accelerator Architecture and Arithmetic
Finally, let's explore AI accelerators. The key building block is the tensor core or matrix engine: specialized hardware performing matrix multiply-accumulate operations. The computation is D equals A times B plus C, where A, B, C, and D are matrices. NVIDIA's tensor cores, introduced with the Volta architecture, perform 4 by 4 by 4 matrix operations per cycle. Google's TPU uses systolic arrays: two-dimensional grids of multiply-accumulate units where data flows through the grid, maximizing reuse and minimizing memory access.
Understanding numeric formats is crucial. FP32, 32-bit floating point, has 1 sign bit, 8 exponent bits, and 23 mantissa bits. This is standard training precision. FP16, 16-bit floating point, has 5 exponent bits and 10 mantissa bits, offering twice the density and faster computation, but with narrow range from about 6 times 10 to the minus 5 up to 65,504.
Google introduced BF16, or BFloat16, with 8 exponent bits and 7 mantissa bits. This preserves FP32's range while still being 16 bits, making it excellent for training. INT8, 8-bit integers with just 256 possible values, offers 4 times FP32 density and has dominated inference since 20 18. New FP8 formats with different exponent and mantissa distributions are emerging.
Mixed precision training uses FP16 or BF16 for forward and backward passes but maintains FP32 master weights to prevent accumulation errors. This gives most of the speedup with minimal accuracy loss.
Training versus inference have different requirements. Training needs backpropagation, which is compute and memory intensive, requires higher precision to prevent gradient underflow, and benefits from large batch sizes. Inference is a single forward pass, latency-sensitive, and benefits enormously from quantization. Training dominates accelerator revenue with NVIDIA's H100 selling for 25,000 to 40,000 dollars, but inference volume is much higher when including edge devices.
Sparsity exploitation is increasingly important. Neural networks often have many zero values, especially after ReLU activation which zeros out negative values. Structured sparsity using block-sparse matrices enables skipping computation on zero blocks. Unstructured sparsity requires metadata to track locations. NVIDIA's A100 supports 2 to 4 sparsity, meaning 2 zeros per 4 elements, with 2 times throughput. The challenge is that memory bandwidth for sparse formats and managing the metadata can reduce benefits below 50 percent sparsity.
NVIDIA dominates training accelerators, largely due to CUDA's software moat. AMD's MI300 is competitive but ecosystem lock-in is powerful. Startups like Cerebras with wafer-scale engines, Graphcore's Intelligence Processing Unit, and SambaNova have struggled against NVIDIA's integrated hardware-software offering. Inference is more fragmented: ARM CPUs, Qualcomm neural processing units, Google's Edge TPU, and Apple's Neural Engine all compete.
Novel architectures offer potential leapfrogs. Analog in-memory computing uses resistive crossbars where resistance encodes weights and Ohm's law performs matrix-vector multiplication. The energy is constant regardless of matrix size, but precision and accuracy are limited by analog noise. Photonic interconnects offer massive bandwidth and energy efficiency for die-to-die communication. Superconducting logic using rapid single-flux-quantum pulses operates at 4 Kelvin and offers roughly 100 times energy efficiency, though cryogenic overhead is substantial.
AI can be used to design better AI hardware, a recursive improvement. Reinforcement learning optimizes scheduling and mapping of neural networks to hardware. Neural architecture search can co-optimize the model and hardware together. Learned quantization schemes find optimal bit-width allocations.
On the moon, vacuum enables novel cooling approaches. You can radiate directly to the lunar night sky at 4 Kelvin with no convection losses. Josephson junctions for superconducting logic are easier without needing cryogenic containment. Photonic integration benefits from vacuum with no atmospheric absorption. Power constraints favor inference over training. Radiation-hardened designs with error correction and redundancy are essential.
For Western fabs competing with TSMC, the US has strength in design with NVIDIA, AMD, Apple, and numerous startups. The opportunity is in co-packaged optics with companies like Intel and Ayar Labs, chiplet architectures mixing logic nodes for compute with mature nodes for input-output, and in-memory computing that bypasses the von Neumann bottleneck. Vacuum packaging enables copper interconnects without oxidation concerns, higher voltage operation without dielectric breakdown from air, and potentially simpler thermal management.
Robotics and automation can dramatically improve AI accelerator development. Automated neural architecture search requires rapid iteration, benefiting from small-lot manufacturing with robotic wafer handling and automated test and characterization. AI-driven yield learning using defect pattern recognition and process parameter optimization accelerates development. Digital twins of fabs allow virtual experimentation before committing to expensive silicon.
Historical ideas worth revisiting include wafer-scale integration, which failed in the past due to defects but is now viable with defect tolerance as Cerebras demonstrates. Optical computing attempted in the 19 80s was limited by materials but now photonic integration is mature. Analog neural networks from the 19 80s suffered from precision and programmability issues but better memristors and resistive RAM may enable them. Reversible computing with adiabatic logic could reduce energy dissipation toward the Landauer limit of k T natural log of 2 per bit operation. Neuromorphic computing with spiking neural networks is event-driven and efficient but the software ecosystem remains immature.
Review of Core Concepts
We've covered process nodes as marketing terms disconnected from physical dimensions, Moore's Law continuing through extreme innovation cost, Dennard scaling breakdown creating the power wall, ASIC design using standard cells and place-and-route, FPGAs offering flexibility at efficiency cost, SoC integration enabled by IP licensing, SRAM's six-transistor speed with area cost, DRAM's one-transistor-one-capacitor density requiring refresh, HBM's stacked architecture with through-silicon vias, the memory wall bottleneck between compute and memory speed, tensor cores performing matrix operations, mixed precision arithmetic balancing speed and accuracy, training versus inference tradeoffs, sparsity exploitation skipping zero computations, analog in-memory computing for energy efficiency, and numerous opportunities for novel architectures and lunar or Western manufacturing advantages.
Technical Overview
Process Technology
Process Node & Scaling: Process nodes (7nm, 5nm, 3nm, 2nm, etc.) nominally refer to feature sizes but are now largely marketing terms disconnected from actual physical dimensions. Modern "5nm" transistor gates are ~24nm long; the number references equivalent transistor density relative to historical nodes. True scaling involves reducing gate length, pitch (distance between features), and contact dimensions. Physics challenges include: quantum tunneling through thin gate oxides (<1nm), subthreshold leakage, short-channel effects, line-edge roughness impact on variability, and RC delay dominance as wire dimensions shrink.
Advanced nodes use FinFETs (14nm+) and Gate-All-Around FETs/nanosheets (3nm+) for better electrostatic control. Future: complementary FETs (stacked NMOS/PMOS), 2D materials (MoS2, graphene), negative capacitance FETs. EUV lithography (13.5nm wavelength) enables sub-10nm patterning; high-NA EUV (0.55 vs 0.33) targets 2nm and beyond.
Moore's Law & Dennard Scaling: Moore's Law (doubling transistor density every ~18-24 months) continues through extreme innovation but at higher cost. Dennard scaling ended ~2005 when voltage couldn't scale proportionally (leakage, threshold voltage limits), causing power density to increase - the "power wall" forcing multi-core architectures and dark silicon (parts of chip unused to stay within thermal limits).
Industry Economics: Leading-edge nodes cost $15-20B per fab (TSMC 3nm). Mask sets cost $5-30M at advanced nodes. Only TSMC, Samsung, Intel pursue leading edge. Opportunities: trailing-node specialization (28nm mature super well), 3D integration to extend density gains without lithographic scaling, novel materials/devices.
Moon Considerations: UHV enables ultra-clean processing, potentially better material deposition. Lower gravity affects chemical vapor deposition flow dynamics, liquid handling. Radiation hardening requirements for cosmic rays/solar particles - SOI (silicon-on-insulator) helps. Limited materials necessitate simpler process flows - older nodes (65nm+) or novel architectures optimizing for available resources.
Western Fab Competition: ASML (Netherlands) monopoly on EUV; Applied Materials, Lam Research (US) strong in deposition/etch. Training talent: leverage AI PhDs' understanding of physics/optimization for process development. AI opportunities: reinforcement learning for process optimization, generative models for novel device structures, automated defect classification. Rapid experimentation with small-lot pilot lines.
Design & Architecture
ASIC vs FPGA: ASICs require full custom layout via place-and-route algorithms (simulated annealing, analytical methods) that position standard cells and route metal interconnects. Standard cells (NAND, NOR, flip-flops) designed once per process node, characterized for timing/power. Modern tools: Cadence, Synopsys. FPGA uses pre-fabricated logic blocks (LUTs) and programmable interconnects (SRAM-controlled switches or antifuses), ~10-100x less efficient but reconfigurable.
SoC & IP: Modern SoCs integrate CPU, GPU, NPU, memory controllers, I/O. IP blocks (ARM cores, SerDes, memory controllers) licensed for $M-$100M+ royalties. Standard cell libraries critical - sizing variations (high-performance vs low-power) provide design flexibility. Physical design challenges: clock distribution (skew <10ps), power delivery (IR drop), signal integrity at multi-GHz.
Industry: Major EDA vendors: Synopsys, Cadence (combined ~$10B revenue). ARM dominates mobile CPU IP. Opportunity: open-source PDKs (SkyWater 130nm), RISC-V open ISA enabling custom accelerators. AI for place-and-route: Google's DeepMind reduced chip design time from months to hours using RL for floorplanning.
Moon/Western Fab: Chiplet architectures critical - design once, manufacture at different nodes/locations. UCIe (Universal Chiplet Interconnect Express) standard emerging. Open-source EDA tools maturing (OpenROAD). Western advantage in design tools/talent (Silicon Valley, Austin, Boston). Cold welding in vacuum enables die-to-die bonding without thermal stress.
Memory
SRAM: 6T (6-transistor) cell stores bit via cross-coupled inverters. Fast (sub-ns access), expensive area-wise (~100F² per bit where F=feature size). Used for caches (L1/L2/L3). Scaling challenges: Vmin instability from device variation (requires statistical modeling), increasing leakage. 8T/10T cells improve read/write stability at cost of area.
DRAM: 1T1C (one transistor, one capacitor) per bit. Capacitor holds ~10-25 fF, must refresh every 64ms (destructive read). Area ~6F² per bit. Modern DRAM uses trench or stacked capacitors (high-k dielectrics: ZrO2-Al2O3). Challenges: capacitance maintenance as scaling reduces volume, leakage through thinner access transistors. DDR5 at 6400 MT/s.
HBM: Stacked DRAM dies (8-16 high) with through-silicon vias (TSVs, ~5μm diameter, 50μm pitch) providing 1024-bit interface per stack. HBM3: 819 GB/s per stack. Manufacturing: die thinning to 40μm, micro-bump bonding, known-good-die testing critical (yield multiplication across stack). Expensive but essential for bandwidth-hungry AI accelerators.
Memory Hierarchy & Wall: CPU register (1 cycle) → L1 cache (~4 cycles, ~32KB) → L2 (~12 cycles, ~256KB) → L3 (~40 cycles, ~32MB) → DRAM (~200 cycles, GB-TB) → SSD (~100K cycles). Memory wall: compute performance grew 59%/year (1986-2004) while DRAM latency improved 1.1x/10 years. Solutions: larger caches, prefetching, HBM, near-memory compute.
Industry: Samsung, SK Hynix, Micron dominate DRAM/NAND. DRAM highly cyclical, capital-intensive. HBM supply constrained (~20% gross margins vs ~40% for high-end DRAM). Opportunity: SRAM alternatives (MRAM, FeRAM for non-volatile embedded), compute-in-memory architectures (analog crossbars for matrix multiply).
Moon: DRAM capacitor dielectrics need precise process control (atomic layer deposition). In vacuum, eliminating native oxide before dielectric deposition improves interface quality. Radiation causes bit flips - ECC overhead higher. HBM assembly in vacuum could enable novel bonding approaches without CTE mismatch concerns from atmosphere.
Western Fab: US DRAM capabilities atrophied (Micron primarily domestic). Huge opportunity in emerging memories: Intel/Micron 3D XPoint (discontinued but technology viable), spin-transfer torque MRAM (Everspin, Applied Materials tooling). Compute-in-memory startups: Mythic (analog), Syntiant (analog neural networks). AI for SRAM cell design: generative models exploring 7T/8T/9T topologies optimizing area-stability-power tradeoffs.
AI Accelerator Concepts
Tensor Cores & Matrix Engines: Specialized units performing D=A×B+C where A,B,C,D are matrices. NVIDIA Tensor Cores (Volta+): 4×4×4 matrix multiply-accumulate per cycle. Systolic arrays (Google TPU): 2D grid of MACs with data flowing through, maximizing data reuse. Implementation: deeply pipelined multipliers, large accumulator trees, optimized for specific precisions.
Numeric Formats: FP32 (1 sign, 8 exponent, 23 mantissa bits): standard training precision. FP16 (5 exp, 10 mantissa): 2x density, faster, but narrow range (6e-5 to 65504). BF16 (8 exp, 7 mantissa): Google's solution preserving FP32 range. INT8: 256 values, 4x FP32 density, dominated inference by 2018. FP8 formats (E4M3, E5M2) emerging. Mixed precision: FP16/BF16 forward/backward pass with FP32 master weights.
Training vs Inference: Training requires backpropagation (compute & memory intensive), high precision (accumulation errors), large batch sizes. Inference: single forward pass, latency-sensitive, benefits from quantization. Training dominates AI accelerator revenue (H100 80GB at $25-40K) but inference volume higher (edge devices).
Sparsity: Structured sparsity (block-sparse matrices) enables skipping zero blocks. Unstructured sparsity requires metadata overhead. NVIDIA A100 supports 2:4 sparsity (2 zeros per 4 elements) with 2x throughput. Activation sparsity (ReLU creates zeros) exploitable with gating. Challenges: memory bandwidth for sparse formats, diminishing returns below 50% sparsity.
Industry & Opportunities: NVIDIA dominates training (CUDA moat). AMD MI300 competitive. Startups (Cerebras wafer-scale, Graphcore IPU, SambaNova) struggle against NVIDIA ecosystem. Inference more fragmented: ARM CPUs, Qualcomm NPUs, Google Edge TPU, Apple Neural Engine.
Novel architectures: analog in-memory compute (resistive crossbars for matrix-vector multiply at constant energy regardless of matrix size, but limited precision/accuracy), photonic interconnects (bandwidth, energy efficiency for die-to-die communication), superconducting logic (RSFQ - single-flux-quantum pulses, operates 4K, ~100x energy efficiency but cryogenic overhead).
AI for AI Hardware: Reinforcement learning for scheduling/mapping neural networks to hardware. Neural architecture search co-optimizing model and hardware. Learned quantization schemes. Auto-tuning libraries (TVM, MLIR compilers).
Moon: Vacuum operation enables novel cooling: direct radiation to lunar night sky (4K), no convection. Josephson junctions for superconducting logic easier without cryogenic containment concerns. Photonic integration benefits from vacuum (no absorption). Power constraints favor inference over training. Radiation-hardened designs: error correction, redundancy.
Western Fab: US strength in design (NVIDIA, AMD, Apple, startups). TSMC manufactures most. Opportunity: co-packaged optics (Intel, Ayar Labs), chiplets for mixing logic nodes (5nm compute) with mature nodes (28nm I/O), in-memory compute (bypass von Neumann bottleneck). Vacuum packaging: enables copper interconnects without oxidation, higher voltage operation (no dielectric breakdown from air), simpler thermal management.
Automation & Robotics: Automated neural architecture search requires rapid iteration - small-lot manufacturing with robotic wafer handling, automated test/characterization. AI-driven yield learning: defect pattern recognition, process parameter optimization. Digital twins of fabs for virtual experimentation before committing to silicon.
Historical & Novel Ideas: Wafer-scale integration (failed due to defects, now viable with Cerebras using defect tolerance). Optical computing (attempted 1980s, limited by materials - now mature photonic integration). Analog neural networks (1980s limited by precision/programmability - now with better memristors/ReRAM). Reversible computing (adiabatic logic reducing energy dissipation toward Landauer limit of kT ln2 per bit). Neuromorphic computing (spiking neural networks, event-driven, but software ecosystem immature). Quantum annealing for optimization problems (D-Wave, limited problem classes). Cryogenic CMOS (77K operation reduces leakage, increases mobility - relevant for co-packaging with quantum/superconducting systems).