Machine Learning Series: 2. Energy Efficiency in AI Accelerators

Kapil bansal
4 min readAug 8, 2024

--

Just for an example, Nvidia’s H100 processors are projected to sell 1.5 million units in 2023 and 2 million units in 2024. With an average TDP of 500W per unit and an additional 20% for cooling, this results in an annual electricity consumption of approximately ~20 Terawatt-hours. This amount of energy could power 1 million plus homes for a year. At an electricity cost of $0.10 per kWh, the annual expenditure would be around $2 billion.

The AI chip market is projected to grow at a CAGR of 30% until 2030. Elon Musk recently predicted that, following the current silicon shortage, we might face shortages in voltage transformers and electricity within a year or two.

While Total Cost of Ownership (TCO) is currently dominated by chip costs, energy costs are expected to become a larger portion in future. AI performance per dollar is important for cloud service companies and energy efficiency is anyway critical for battery-powered devices.

This article dives in various energy optimization options.

Energy Efficiency Through Process Nodes: One aspect of hardware energy efficiency comes from process node choice. Here is a table for simple 1st order impact assessment.

Figure 1: Scaling across process Nodes

Energy consumption for matrix multiplication in a 5nm process node based AI chip is 79% less than in a 14nm node. For simple perspective,⅕ power will imply 1 billion Dollar electricity bill reducing to 200 million. It also implies the ability to enable some new usecases/applications on battery powered devices. However, iso-performance cost increases due to higher complexity/cost of lower process nodes. Increasing power density is a separate challenge that gets amplified as we go down process nodes.

Operation wise Energy Consumption: in below table, refer to the comparative difference in energy consumption of these operations on a vector inside the silicon. Vector is Int8 data type & size 128.

Figure 2: Energy consuming operations inside AI silicon

Architecture optimization aims to minimize net energy consumption, calculated as:

Energy = ∑ (Frequency_i x energy_operation_i)

For instance, using HBM instead of DDR memory can save one-third of the corresponding power and provide performance gains due to HBM’s higher throughput. Implementing on-chip SRAM instead of DDR or HBM can significantly increase TOPS/Watt and performance. Any architecture that can minimize energy wastage in the last 4 rows can gain good tops/w. Optimizations can be hardware-only, hardware-software co-design, or software-only, such as quantization, which reduces model parameters by orders of magnitude and data type sizes without sacrificing accuracy. The topic deserves a separate article in itself.

Above table also highlights why an inference chip with low Tops requirement can be more efficient compared to chip with high Tops requirement. & why it makes sense to have different AI chips depending on segmentation on usecases.

Matrix Multiplication Optimization: Matrix multiplication is a primary power consumer. Any optimized architecture will have a higher percentage of power going in matrix multiplication. There are various implementation choices that vary in power and silicon area .Techniques such as specialized circuits and in-SRAM compute approaches can drastically reduce energy consumption in this operation. This topic will be covered in more details in a separate follow up article.

Exploiting Sparsity: Sparsity in data structures can be used to eliminate unnecessary multiplications and additions. For example, if 30% of the elements in a matrix are zeros, a 30% reduction in MAC operations can save power and increase throughput. Detailed implementation options for efficient sparsity will be covered in a separate article.

Minimizing Temporary Data: This is an interesting optimization that falls under hardware-software codesign. Larger computation chunks generate less temporary data, saving energy that would otherwise be used to store and retrieve this data.

Chiplets and Large Chips: Especially for training or inference of very large models, it is possible that one chunk of computation (e.g. one decoder layer in inference) can be large enough to contain efficiently in 1chip. Using large chips or chiplets can prevent power loss from chip-to-chip transfers. Compare a few terabytes passing through two interfaces. One with 10pJ per bit transfer & other with 1pJ per bit transfer. This would highlight the efficiency gains from this approach.

Batch Processing: Matrix multiplication power is saved when input values remain constant over consecutive operations. So if weights(one input) can stay the same for a few back to back matrix multiplication operations then power consumption can see noticeable improvement. This also prevents the need to load weights from local SRAM each cycle, further saving power. This comes at the cost of increasing latency of one inference. Overall throughput of inference/second remains the same though. Some cloud applications can really afford to have tolerance to per inference latency e.g. background processing on pictures.

Sequencing and Data Type Encoding: minimizing the number of input bit toggles during consecutive matrix multiplication can save power. Sequencing operations to reduce toggles and having data type that leverages the non-flat probability distribution of weights can further enhance efficiency.

For comparing Architectures for Energy Efficiency, metrics like Joules per inference and inferences per dollar are useful but may not reveal all optimization opportunities. A detailed model comparing operation frequencies with theoretical minimums for a given model could be more insightful.

Note: While this article focuses on hardware-only or co-design energy optimizations, significant savings also come from software optimizations like algorithm tuning and model refinement. Additionally, analyzing TCO contributors and dominant use cases in the cloud will offer a broader perspective. Stay tuned for future articles on these topics.

--

--

Kapil bansal
0 Followers

ML enthusiast. Hardware design leadership. Compute hardware expert IIT Delhi. Based in Bay Area.