<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.8.7">Jekyll</generator><link href="https://hawaii.ziti.uni-heidelberg.de/feed.xml" rel="self" type="application/atom+xml" /><link href="https://hawaii.ziti.uni-heidelberg.de/" rel="alternate" type="text/html" /><updated>2026-03-12T12:14:26+00:00</updated><id>https://hawaii.ziti.uni-heidelberg.de/feed.xml</id><title type="html">HAWAII Lab</title><subtitle>Hardware and Artificial Intelligence (HAWAII) Lab at Heidelberg University, Germany</subtitle><author><name>Hardware and Artificial Intelligence (HAWAII) Lab</name><email>webmaster@ziti.uni-heidelberg.de</email></author><entry><title type="html">System-Level Power Profiling for AI Workloads at ESAIL</title><link href="https://hawaii.ziti.uni-heidelberg.de/blog/esail-sl-power/" rel="alternate" type="text/html" title="System-Level Power Profiling for AI Workloads at ESAIL" /><published>2025-11-13T00:00:00+00:00</published><updated>2025-11-13T00:00:00+00:00</updated><id>https://hawaii.ziti.uni-heidelberg.de/blog/esail_profiling_system</id><content type="html" xml:base="https://hawaii.ziti.uni-heidelberg.de/blog/esail-sl-power/">&lt;p&gt;&lt;img src=&quot;/images/blog_entries/esail_profiling_system/esail_profiling_system_closeup.jpg&quot; alt=&quot;Running ESAIL Setup&quot; /&gt;&lt;/p&gt;
&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 1: ESAIL Open Bench Table Setup
&lt;/figcaption&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Machine-learning workloads rely heavily on GPU computation and exhibit high energy consumption. With the rapid growth of both training and large-scale inference deployments, energy efficiency has become a key research concern alongside performance. This is especially important for deployments on embedded devices, where limited compute resources must operate within strict energy budgets. Optimizing machine-learning algorithms with respect to their energy consumption is therefore a central focus of ongoing research at the Energy-Efficient Systems and AI Lab (ESAIL).&lt;/p&gt;

&lt;p&gt;Evaluating energy-aware optimizations requires accurate and reproducible power measurements under realistic workload conditions. To this end, ESAIL employs a new profiling system, designed to capture detailed system-level power and energy metrics during real workload execution.&lt;/p&gt;

&lt;h2 id=&quot;why-system-level-power-profiling&quot;&gt;Why System-Level Power Profiling?&lt;/h2&gt;

&lt;p&gt;Most modern CPUs and GPUs expose onboard power telemetry, but these sensors are fundamentally limited in scope. They report only the power consumed by the chip itself, not the energy required to operate the entire system. In modern PC systems, system memory (RAM) and GPU-attached memory (VRAM) often represent major power consumers after the primary compute devices, particularly under data-intensive workloads.&lt;/p&gt;

&lt;p&gt;For machine-learning workloads, this distinction is critical, as such workloads are characterized by high memory bandwidth requirements and frequent data movement between memory and compute units. In addition, large models and datasets often require substantial memory capacity, typically provided by multiple RAM modules and large VRAM configurations, which increases baseline energy demand. Furthermore, both training and large-scale inference workloads frequently execute over extended periods of time. As a result, even relatively small and unmeasured power contributions, such as those originating from memory subsystems, can accumulate and significantly distort total energy estimates.&lt;/p&gt;

&lt;p&gt;Additional components, such as the motherboard, chipset, and power-delivery infrastructure (e.g., VRMs), as well as peripheral devices including USB and LAN controllers, also contribute to overall system power consumption. These contributions are largely workload-independent and therefore not the primary focus of AI energy optimization; however, they contribute to the total system energy budget and become relevant when reporting absolute energy consumption or comparing across different system configurations.&lt;/p&gt;

&lt;p&gt;The goal of the profiling system used at ESAIL is to address the limitations of vendor-provided power sensors by enabling direct measurement of system-level power consumption while still allowing attribution to individual subsystems (e.g., CPU, GPU, and motherboard). This makes it possible to evaluate optimizations in terms of actual energy cost, rather than approximations derived from partial telemetry.&lt;/p&gt;

&lt;h2 id=&quot;profiling-setup-based-on-benchlab&quot;&gt;Profiling Setup Based on BENCHLAB&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/images/blog_entries/esail_profiling_system/benchlab_pcb.png&quot; alt=&quot;BENCHLAB PCB&quot; width=&quot;700&quot; style=&quot;display:block; margin-left:auto; margin-right:auto&quot; /&gt;&lt;/p&gt;
&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 2: BENCHLAB PCB (&lt;a href=&quot;https://benchlab.io/&quot;&gt;Source: benchlab.io&lt;/a&gt;)
&lt;/figcaption&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;To enable system-level power measurements, the profiling system employs &lt;a href=&quot;https://benchlab.io/&quot;&gt;BENCHLAB&lt;/a&gt; as an external telemetry layer between the power supply unit (PSU) and the system under test (SUT). BENCHLAB is a dedicated measurement PCB, the size of an ATX motherboard, that intercepts all primary power rails supplying the system.&lt;/p&gt;

&lt;p&gt;The board performs direct electrical measurements on the power rails and supports all standard PC power connectors, including the 24-pin ATX (motherboard), 4+4-pin EPS (CPU), PCIe auxiliary power, and 12VHPWR (GPU) interfaces. To fully capture GPU power consumption, an additional &lt;a href=&quot;https://benchlab.io/products/benchlab-pci-e-slot-power-measurement-adapter&quot;&gt;PCIe slot power measurement adapter&lt;/a&gt; is placed between the motherboard’s PCIe slot and the GPU. This enables measurement of slot-delivered power, which is combined in software with the power delivered by the auxiliary connectors to obtain the total GPU power draw.&lt;/p&gt;

&lt;p&gt;While BENCHLAB also exposes additional features such as temperature sensing, fan-speed monitoring, and RGB control, the profiling system used in this work exclusively utilizes its electrical power measurement capabilities to obtain accurate and reproducible power data.&lt;/p&gt;

&lt;p&gt;Measurement data from BENCHLAB is streamed to the host system via USB and acquired using a Python-based data acquisition pipeline built on pyserial. Raw sensor readings are decoded and aggregated in software and logged in CSV format, enabling seamless integration into Python-based machine-learning scripts and post-processing workflows. Power measurements are sampled at a frequency exceeding 400 Hz, with a nominal measurement accuracy of approximately 3 %. This allows power and energy metrics to be correlated directly with specific phases of model execution, such as training and inference. Alternatively, a Grafana-based live dashboard provides real-time visualization of system-level power consumption.&lt;/p&gt;

&lt;p&gt;In summary, the profiling system enables system-level power measurements with explicit attribution to major subsystems. In the current setup, power consumption is measured separately for the CPU, the GPU, and the motherboard, providing a comprehensive view of overall system energy usage during workload execution.&lt;/p&gt;

&lt;h4 id=&quot;current-system-under-test-sut&quot;&gt;Current System under test (SUT)&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;CPU: AMD Ryzen 9 9950X3D&lt;/li&gt;
  &lt;li&gt;GPU: AMD Radeon RX 7900 GRE&lt;/li&gt;
  &lt;li&gt;RAM: VENGEANCE (2 x 48 GB) DDR5 6000 MT/s&lt;/li&gt;
  &lt;li&gt;SSD: Samsung SSD 9100 PRO 2TB&lt;/li&gt;
  &lt;li&gt;Motherboard: Gigabyte X870 AORUS ELITE WIFI7&lt;/li&gt;
  &lt;li&gt;Measurement hardware: BENCHLAB&lt;/li&gt;
  &lt;li&gt;OS: Ubuntu&lt;/li&gt;
  &lt;li&gt;GPU software stack: ROCm&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/images/blog_entries/esail_profiling_system/grafana_gpu_power_dashboard.png&quot; alt=&quot;Grafana GPU Power Dashboard&quot; /&gt;&lt;/p&gt;
&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 3: Grafana GPU Power Dashboard
&lt;/figcaption&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;early-results&quot;&gt;Early Results&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/images/blog_entries/esail_profiling_system/gpu_power_raw.png&quot; alt=&quot;GPU Power Raw&quot; /&gt;&lt;/p&gt;
&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 4: GPU Power Measurement (Raw Data, 1&amp;nbsp;kHz, ShuffleNetV2)
&lt;/figcaption&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Figure 4 shows the raw GPU power measurements obtained from the vendor-provided sensor via amd-smi (blue) and from the external BENCHLAB measurement (orange) during training of ShuffleNetV2 for one epoch. The power trace reported by the vendor sensor exhibits pronounced temporal variability with frequent short-lived upward and downward excursions, resulting in a visually noisy power trace. In contrast, the power trace measured by BENCHLAB appears comparatively smooth.&lt;/p&gt;

&lt;p&gt;This difference in signal characteristics suggests that the two measurement approaches capture different aspects of GPU power consumption. The highly dynamic behavior observed in the vendor-reported data is consistent with workloads involving frequent kernel launches and terminations, which lead to rapid changes in chip-level activity. By comparison, BENCHLAB measures the total GPU board power, including the effects of voltage-regulation modules (VRMs) and onboard capacitance. These components can buffer short-term power fluctuations, leading to a smoother signal at the board level. As a result, BENCHLAB does not resolve fine-grained intra-chip power dynamics but instead reflects the aggregated power demand of the GPU board.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/blog_entries/esail_profiling_system/gpu_power_smoothed.png&quot; alt=&quot;GPU Power Smoothed&quot; /&gt;&lt;/p&gt;
&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 5: GPU Power Measurement (ShuffleNetV2, rolling average over 200 samples)
&lt;/figcaption&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Figure 5 shows the same measurements after applying a moving average filter with a window size of 200 samples. After smoothing, a systematic offset between the two signals becomes apparent: the power reported by amd-smi is approximately 20 W lower than the power measured by BENCHLAB. In addition, a temporal offset of roughly one second can be observed between the vendor sensor data and the BENCHLAB measurements.&lt;/p&gt;

&lt;p&gt;This temporal delay is likely attributed to the data acquisition pipeline and software-level latencies. However, the systematic power offset confirms that the vendor-reported power values underestimate the actual GPU board power under the examined workload. While amd-smi tracks chip-level activity, BENCHLAB captures the total board-level power draw, including the efficiency losses from voltage regulation and the buffering effects of onboard capacitance.&lt;/p&gt;

&lt;p&gt;Taken together, these observations indicate that chip-level telemetry and external board-level measurements are not directly interchangeable and may lead to substantially different energy estimates when integrated over time. For this experiment using ShuffleNetV2, the resulting difference in total energy consumption amounts to 11.7 %. Depending on workload characteristics and execution duration, this discrepancy may be higher or lower. For example, in the case of VGG19, which places significantly higher demands on memory, the corresponding difference was 15.9 % on average.&lt;/p&gt;

&lt;h4 id=&quot;summary&quot;&gt;Summary&lt;/h4&gt;

&lt;p&gt;The comparison between vendor-reported GPU power telemetry and external board-level measurements highlights systematic differences in both signal dynamics and absolute power levels. While on-chip telemetry reflects rapid changes in GPU activity, external measurements capture the aggregated power demand after the buffering effects introduced by the power-delivery infrastructure.&lt;/p&gt;

&lt;p&gt;Ultimately, the choice of measurement methodology significantly impacts the interpretation of energy efficiency. While vendor telemetry sensors help to shape our understanding of power dynamics across numerous short-lived kernel launches, external board-level measurements are essential for determining the true energy cost of long-lived workloads. For workload execution central to modern AI research, these discrepancies accumulate into a substantial margin of error. The profiling system used at ESAIL provides the necessary ground truth to ensure that energy-aware optimizations are evaluated against their actual physical footprint.&lt;/p&gt;</content><author><name>Andrej</name></author><category term="blog" /><category term="System-Level Power Measurement" /><summary type="html">Vendor GPU telemetry captures fast on-chip power dynamics, while external board-level measurements reveal true aggregated energy cost, making the latter essential for accurately evaluating energy efficiency in long-running AI workloads.</summary></entry><entry><title type="html">Walking Noise</title><link href="https://hawaii.ziti.uni-heidelberg.de/blog/walking-noise/" rel="alternate" type="text/html" title="Walking Noise" /><published>2024-07-29T00:00:00+00:00</published><updated>2024-07-29T00:00:00+00:00</updated><id>https://hawaii.ziti.uni-heidelberg.de/blog/walking-noise</id><content type="html" xml:base="https://hawaii.ziti.uni-heidelberg.de/blog/walking-noise/">&lt;p&gt;&lt;img src=&quot;/images/blog_entries/walking_noise/silly_walk.webp&quot; alt=&quot;A silly walk&quot; width=&quot;600&quot; style=&quot;display:block; margin-left:auto; margin-right:auto&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 1: The Ministry of Silly Walks, although specializing in seemingly chaotic though regular walking patterns as portrayed here, is credited as being one of the pioneers in the field of Walking Noise. &lt;a href=&quot;https://www.independent.co.uk/news/science/mystery-solved-ndash-by-ministry-of-silly-walks-1764014.html&quot;&gt;Source: BBC&lt;/a&gt;
&lt;/figcaption&gt;

&lt;p&gt;We are happy to announce a new contribution to ECLM 2024 by Hendrik Borras&lt;code class=&quot;highlighter-rouge&quot;&gt;*&lt;/code&gt;, Bernhard Klein&lt;code class=&quot;highlighter-rouge&quot;&gt;*&lt;/code&gt;; and Holger Fröning.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;*&lt;/code&gt;: These authors share first authorship, with different emphasis on methodology, experimentation, data analysis and research narrative.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/blog_entries/walking_noise/global_noise_with_descriptions.png&quot; alt=&quot;Midpoint-noise methodology used in the Walking Noise paper.&quot; /&gt;&lt;/p&gt;
&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 2: Midpoint-noise methodology used in the Walking Noise paper: Midpoint noise level μ for the example of LeNet-5/CIFAR-10/BN and globally injected additive noise.
&lt;/figcaption&gt;

&lt;h2 id=&quot;noisy-computations---an-inevitable-stepping-stone-on-the-path-towards-lowering-dnn-energy-needs&quot;&gt;Noisy Computations - An inevitable stepping stone on the path towards lowering DNN energy needs?&lt;/h2&gt;

&lt;p&gt;Deep neural networks show remarkable success in various applications.
However, they unfortunately exhibit both high computational and energy demands.
This is exacerbated by stuttering technology scaling, prompting the need for novel approaches to handle increasingly complex neural architectures.
Alternative computing technologies, e.g. analog computing, promise groundbreaking improvements in energy efficiency, but are inevitably fraught with noise and inaccurate calculations.
This requires countermeasures to ensure functionally correct results like any kind of unsafe optimization.
Potential approaches that enable mitigation of such noise or allow networks to tolerate it could thus lead to more energy efficient, and - given a fixed power budget - more time efficient neural network training and inference.&lt;br /&gt;
At the same time lowering the energy cost of training and deploying neural networks is of utmost importance, be it to enable larger models or to decrease the CO2 footprint associated with them.
These topics are thus an important point of study in our research group.
In line with this effort, our recently &lt;a href=&quot;https://ecmlpkdd.org/2024/program-accepted-papers-research-track/&quot;&gt;accepted contribution&lt;/a&gt; is an abstract look at the implications of such noise on neural networks.
The presented approach is applied to neural network classifiers as representative workloads with the specific goal of studying the impact of noisy computations on accuracy.&lt;/p&gt;

&lt;h2 id=&quot;our-approach&quot;&gt;Our approach&lt;/h2&gt;

&lt;p&gt;To achieve this, we introduce &lt;em&gt;Walking Noise&lt;/em&gt;, a method of injecting layer-specific noise to measure robustness and to provide insights on learning dynamics.
More specifically, we investigate the implications of additive, multiplicative and mixed noise for different classification tasks and model architectures.&lt;br /&gt;
While noisy training significantly increases robustness for all noise types, we observe in particular that it results in increased weight magnitudes.
This inherently improves the signal-to-noise ratio for additive noise injection.&lt;br /&gt;
Contrarily, training with multiplicative noise can lead to a form of self-binarization of the model parameters, leading to extreme robustness.
We conclude with a discussion of the use of this methodology in practice, among others, discussing its use for tailored multi-execution in noisy environments.&lt;/p&gt;

&lt;h2 id=&quot;major-takeaways&quot;&gt;Major Takeaways&lt;/h2&gt;

&lt;p&gt;An important insight gained is the self binarization effects we observed for certain types of noise.
They are able to mitigate arbitrarily large amounts of noise while maintaining a working neural network.
Shown below is a example of these effects on a sample neural network with a specific look at the activation structure.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/blog_entries/walking_noise/self_bin.png&quot; alt=&quot;Midpoint-noise methodology used in the Walking Noise paper.&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Our work additionally lays out methods for using Walking Noise in practical settings to effectively combat the effects of noise on neural network inference.
Distributing multiple executions across an analog accelerator in an intelligent manner as shown below can greatly benefit the predictor accuracy.
&lt;img src=&quot;/images/blog_entries/walking_noise/multi_execution_table.png&quot; alt=&quot;Midpoint-noise methodology used in the Walking Noise paper.&quot; /&gt;&lt;/p&gt;
&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Table 1: Walking Noise guiding multi-execution to improve accuracy.
&lt;/figcaption&gt;

&lt;h2 id=&quot;find-out-more&quot;&gt;Find Out More&lt;/h2&gt;
&lt;p&gt;Take a look at the paper accepted for ECML 2024: &lt;a href=&quot;https://ecmlpkdd.org/2024/program-accepted-papers-research-track/&quot;&gt;ECML 2024 accepted research papers&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or checkout the arxiv pre-print: &lt;a href=&quot;https://arxiv.org/abs/2212.10430&quot;&gt;arXiv pre-print&lt;/a&gt;&lt;/p&gt;</content><author><name>Hendrik</name></author><category term="blog" /><category term="Noisy Computing" /><summary type="html">On Layer-Specific Robustness of Neural Architectures against Noisy Computations and Associated Characteristic Learning Dynamics</summary></entry><entry><title type="html">How much “Brain Damage” can an LLM Tolerate?</title><link href="https://hawaii.ziti.uni-heidelberg.de/blog/llm-brain-damage/" rel="alternate" type="text/html" title="How much “Brain Damage” can an LLM Tolerate?" /><published>2024-07-03T00:00:00+00:00</published><updated>2024-07-03T00:00:00+00:00</updated><id>https://hawaii.ziti.uni-heidelberg.de/blog/llm-brain-damage</id><content type="html" xml:base="https://hawaii.ziti.uni-heidelberg.de/blog/llm-brain-damage/">&lt;p&gt;&lt;img src=&quot;/images/blog_entries/llm_brain_damage/face_punch.webp&quot; alt=&quot;A boxer getting punched in the face&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 1: A boxer in a situation that may cause brain damage &lt;a href=&quot;#1&quot;&gt;[1]&lt;/a&gt;.
&lt;/figcaption&gt;

&lt;p&gt;Resistive Memory or Resistive RAM (RRAM), a type of random access memory based on memristors, is an area of research that is experiencing ever increasing interest because of its unique combination of properties:
It offers high density, low power consumption (when reading from it, we will get to that later), but is also persistent &lt;a href=&quot;#2&quot;&gt;[2]&lt;/a&gt;.
As a machine learning engineer, this makes it very attractive, as it could potentially open the door to deploy large models, including LLMs, to many more devices than is possible today, such as edge devices.&lt;br /&gt;
Such a deployment scenario has many benefits, among them better privacy and security.
Of course, RRAM is not magic, and it comes with many caveats that might be surprising at first.
These include the currently limited endurance in writing, but, more severely, that reading and writing to and from current RRAM technologies is inherently noisy.&lt;br /&gt;
For those who are interested in a detailed treatise on the implications of noise in resistive memory on deep neural networks, I recommend the recent publication by my colleagues Emonds, Xi, and Fröning &lt;a href=&quot;https://arxiv.org/abs/2401.05820&quot;&gt;“Implications of Noise in Resistive Memory on Deep Neural Networks for Image Classification”&lt;/a&gt;, which is the basis and motivation for this article.&lt;br /&gt;
For those who don’t have the time or motivation to do so, I will give a short overview of the benefits and drawbacks of RRAM and how this impacts deep neural networks, and lastly, our experiments on giving an LLM RRAM-write-noise-based “Brain Damage”.
The code implementing simulated RRAM write noise used in these experiments was also developed by the aforementioned authors, which I want to thank here for helping me in my endeavour.
I wish I could report that no LLMs were hurt in these experiments, but our model did scream for help at one point.
I will leave the ethics of performing such experiments as an exercise to the reader.&lt;br /&gt;
Now, let’s talk about filaments!&lt;/p&gt;

&lt;h2 id=&quot;what-exactly-is-rram&quot;&gt;What exactly is RRAM?&lt;/h2&gt;

&lt;p&gt;That’s right, filaments.
In the following, I will describe Metal-Oxide-based RRAM, &lt;script type=&quot;math/tex&quot;&gt;HfO_2&lt;/script&gt;-based RRAM, which is among the most commonly used RRAM materials &lt;a href=&quot;#2&quot;&gt;[2]&lt;/a&gt;.
Note that RRAM technology is a field of study that sees a lot of research interest at the moment and(June 2024) and is advancing comparatively quickly, and not all current RRAM technologies investigated are Metal-Oxide-based &lt;a href=&quot;#3&quot;&gt;[3]&lt;/a&gt;.
As such, it might not anymore be the most prevalent type of RRAM by the time you are reading this.
If this article becomes part of the training corpus of GPT-5, don’t blame me if you fail your class because it bases your paper on outdated information.&lt;br /&gt;
While DRAM and SRAM, the most common RAM technologies today, are charge-based, RRAM stores information by growing conductive filaments between two electrodes.
The two states represented by the presence or absence of these filaments are called low-resistive state (LRS) and high-resistive state (HRS) respectively.
The conductive filaments in Metal-Oxide-based RRAM are tendrils of oxygen vacancies connecting the two electrodes &lt;a href=&quot;#4&quot;&gt;[4]&lt;/a&gt;.
They are created by applying a forming voltage to the electrodes, which creates a channel of oxygen vacancies, with the corresponding oxygen ions migrating to the top electrode.
&lt;a href=&quot;#rram-illustration&quot;&gt;Figure 2&lt;/a&gt; shows this filament growth process on the example of a &lt;script type=&quot;math/tex&quot;&gt;HfO_2&lt;/script&gt;-based RRAM cell, simulated using the finite element method &lt;a href=&quot;#5&quot;&gt;[5]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img id=&quot;rram-illustration&quot; src=&quot;/images/blog_entries/llm_brain_damage/rram_filament_growth_simulation.webp&quot; alt=&quot;Illustration of a finite-element-based simulation of the filament formation process inside a Hafnium-Oxide RRAM cell&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 2: Finite element simulation of growth of a conductive filament between the two conductive filaments inside an $$HfO_2$$-based RRAM cell &lt;a href=&quot;#5&quot;&gt;[5]&lt;/a&gt;.
&lt;/figcaption&gt;

&lt;p&gt;This formation process represents the initial transition from the HRS to the LRS &lt;a href=&quot;#4&quot;&gt;[4]&lt;/a&gt;.
In RRAM, this state represents a logical “1”.
After the filaments are created, changing from HRS to LRS requires a set-voltage that is usually lower than the forming voltage.&lt;br /&gt;
Returning from the LRS to the HRS can be performed in two ways:
For bipolar memristors, an electric field is applied in the reverse direction,causing the oxygen ions in the top electrode to migrate back into the conductive filament.
For unipolar memristors, the reset process is performed by applying a reset voltage higher than the set voltage, which causes heating that enables the oxygen ions to diffuse back.
The HRS represents a logical “0” in a memristor.&lt;/p&gt;

&lt;h2 id=&quot;why-is-rram-attractive-for-deep-neural-networks&quot;&gt;Why is RRAM attractive for deep neural networks?&lt;/h2&gt;

&lt;p&gt;While SRAM requires a supply voltage and DRAM requires periodic refreshes to keep their state, RRAM is a type of non-volatile memory.
Compared to other types of non-volatile memory such as the NAND-based memory commonly found in SSDs, the read and write times of RRAM are comparible to DRAM, with read and write times around 10 ns &lt;a href=&quot;#6&quot;&gt;[6]&lt;/a&gt;.
At the same time, its density is comparible to NAND-based memory, with densities roughly double that of DRAM and 35 times the density of SRAM.
As you probably know, LLMs are getting larger and larger, their context windows longer.
The size of the training sets is likewise increasing every year.&lt;/p&gt;

&lt;p&gt;&lt;img id=&quot;llm-parameter-and-training-token-development&quot; src=&quot;/images/blog_entries/llm_brain_damage/llm_parameter_and_training_token_developement.png&quot; alt=&quot;LLM parameter and training token development over time&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 3: Development of LLM parameter counts and training data tokens for a selection of well-known models. The parameter counts and token counts shown here are provided in the works of Zhao et al. &lt;a href=&quot;#7&quot;&gt;[7]&lt;/a&gt; and Minaee et al. &lt;a href=&quot;#8&quot;&gt;[8]&lt;/a&gt;.
&lt;/figcaption&gt;

&lt;p&gt;While recent developments such as extreme quantization &lt;a href=&quot;#9&quot;&gt;[9]&lt;/a&gt; might help, and we might soon even be running out of quality data to feed these models with &lt;a href=&quot;#10&quot;&gt;[10]&lt;/a&gt;, the energy requirements are still a major hindrance for the continued scaling up of LLMs, not to mention  pervasive deployment in edge devices.
This isn’t even taking into account the immense CO2 footprint that running LLMs embodies &lt;a href=&quot;#11&quot;&gt;[11]&lt;/a&gt;.
At the same time, we are now used to the fast response times that ChatGPT offers, so most users are unlikely to be interested in talking to a local model, no matter the privacy and security advantages, if it has bad response latency.&lt;br /&gt;
But wait!
Doesn’t RRAM address most of these problems?
Exactly!
Low access latency, high density enabling large memory, no standby power required, and it’s even non-volatile!
Sure, the write energy is high compared to technologies such as SRAM or DRAM, but with RRAM, but if used for weights, which don’t get written often if the memory is large enough to store the whole model, but its speed and density more than make up for that in a datacenter scenario.
If you think about common edge deployment scenarios for LLMs like home assistents, the ability to just keep a model in memory at zero energy cost until it is needed is a major benefit.&lt;br /&gt;
At this point you are probably thinking:
This all sounds too good to be true.
Where is the catch?&lt;/p&gt;

&lt;h2 id=&quot;rram-noise&quot;&gt;RRAM Noise&lt;/h2&gt;

&lt;p&gt;Now, while no memory technology is perfect, RRAM unfortunately exhibits much higher noise than other memory technologies, both in reading and in writing &lt;a href=&quot;#2&quot;&gt;[2]&lt;/a&gt;.
My colleagues Emonds, Xi, and Fröning pose that the read noise observed in RRAM is negligable compared to write noise, which I don’t have a reason to doubt.
Because of this, their work only considered write noise as it has a much larger influence on the overall observed noise.&lt;br /&gt;
The write noise in RRAM is mostly caused by cycle-to-cycle variability in resistance, with “cycle” in this case referring to a set operation followed by a reset operation.
The growth of the filaments between the top and bottom electrodes is of course never exactly identical, neither between RRAM cells or between cycles in the same cell.
In this, the resistances follow a lognormal distribution &lt;a href=&quot;#12&quot;&gt;[12]&lt;/a&gt;.
The actual cause of the write noise is an overlap of the LRS and HRS resistance distributions &lt;a href=&quot;#2&quot;&gt;[2]&lt;/a&gt;.&lt;br /&gt;
Abstracting details of the readout of RRAM values, this write noise in an $HfO_2$-based RRAM cell manifests in bit flips with a probability that can be as low as &lt;script type=&quot;math/tex&quot;&gt;10^{-8}&lt;/script&gt; or as high as &lt;script type=&quot;math/tex&quot;&gt;10^{-1}&lt;/script&gt;.&lt;br /&gt;
And remember, this is the &lt;em&gt;bit flip&lt;/em&gt; probability.
In a 16-bit floating-point number, popularly used in LLMs &lt;a href=&quot;#7&quot;&gt;[7]&lt;/a&gt;, a bit flip probability of &lt;script type=&quot;math/tex&quot;&gt;10^{-8}&lt;/script&gt; only results in a probability of &lt;script type=&quot;math/tex&quot;&gt;1.6^{-7}&lt;/script&gt; that at least one of the bits is changed, but that probability goes up to 14 % at &lt;script type=&quot;math/tex&quot;&gt;10^{-2}&lt;/script&gt; bit flip probability.
For a noise level of &lt;script type=&quot;math/tex&quot;&gt;10^{-1}&lt;/script&gt;, the chances are 81 %.&lt;br /&gt;
I simulated the impact of different noise levels of this type on a 32-bit float RGB image in the animation below.&lt;/p&gt;

&lt;!-- &lt;video muted autoplay&gt;
  &lt;source src=&quot;/images/blog_entries/llm_brain_damage/rram_write_noise_simulation_demo.mp4&quot; type=&quot;video/mp4&quot;&gt;
&lt;/video&gt; --&gt;

&lt;p&gt;&lt;img id=&quot;rram_write_noise_simulation_demo&quot; src=&quot;/images/blog_entries/llm_brain_damage/rram_write_noise_simulation_demo.gif&quot; alt=&quot;Iterative application of the simulated RRAM write noise on an image of the author&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 4: Iterative application of the RRAM write noise simulation developed by Emonds, Xi, and Fröning &lt;a href=&quot;#2&quot;&gt;[2]&lt;/a&gt; at different strengths on an example image. The probabilities given here apply to both 0 to 1 and 1 to 0 bit changes.
&lt;/figcaption&gt;

&lt;p&gt;While the higher end of this distribution is obviously problematic in any setting, because of their size, LLMs stand a high chance that a non-negligible amount of corruption occurs, even for low bit flip probabilities.
Even at the lower end of &lt;script type=&quot;math/tex&quot;&gt;10^{-8}&lt;/script&gt;, for GPT-3 with its 175 billion 16-bit float parameters, one can expect that around 28000 of them would be corrupted when stored.&lt;br /&gt;
It gets even worse for floating-point data, as not all bits are made equal.
In the given scenario, around 1750 of the parameters would have a bit flip in the sign bit, while around 8750 of them would experience at least one bit flip in their exponent (not to mention that in IEE754 floats, many numbers, such as 1 and upwards, are only 1 bit flip away from being turned into a NaN or &lt;script type=&quot;math/tex&quot;&gt;\pm\inf&lt;/script&gt;).&lt;br /&gt;
There are of course ways to deal with this.
Using fixed-point or integer weights and activations is an obvious first step as it sidesteps the aforementioned problems inherent to floats &lt;a href=&quot;#2&quot;&gt;[2]&lt;/a&gt;.
Emonds, Xi, and Fröning show that quantization to 8-bit integers can improve the performance of image classification models up to 1167 times compared to 32-bit floats.
Another solution to the problem is to read back values after they are stored to RRAM, correcting the bits affected by noise, but this of course increases the latency and energy requirements for storing data.&lt;br /&gt;
Now personally, I am a big proponent of the &lt;em&gt;“Move Fast and Break Things”&lt;/em&gt; approach, especially in the early stages of playing around with new models and data.
And it just so happened that Google released its new lightweight and open source Gemma LLM right around the time I started working on resistive memory…&lt;/p&gt;

&lt;h2 id=&quot;why-are-we-doing-this-to-a-helpless-language-model&quot;&gt;WHY are we doing this to a helpless language model?&lt;/h2&gt;
&lt;p&gt;When this idea came to me, I had already been experimenting with Gemma for usage as a tool for our research group as an alternative to using closed-source LLMs.
The specific version I had been using was the largest version of Gemma in its instruction-tuned variant, &lt;a href=&quot;https://huggingface.co/google/gemma-7b-it&quot;&gt;Gemma-7b-it&lt;/a&gt;.
So far, things had been going pretty smoothly, I had set up a simple interface using &lt;a href=&quot;https://www.gradio.app/&quot;&gt;Gradio&lt;/a&gt;, and had shared the model I hosted on one of our servers with my colleagues so they could try it out.&lt;br /&gt;
When I then started reading up on RRAM for an upcoming research project, it didn’t take me long to come up with the idea to just… see what happens when you expose an LLM to RRAM write noise.
Why not!
Although this project was useful to me for getting familiar with the RRAM write noise simulation methodology developed by my colleagues &lt;a href=&quot;#2&quot;&gt;[2]&lt;/a&gt;, I would be lying if said that the prospect of making an LLM say silly things wasn’t a big part of my motivation.&lt;br /&gt;
So, to answer the question posed in the title of this section: Because we can!&lt;/p&gt;

&lt;h2 id=&quot;how-are-we-doing-this-to-a-helpless-language-model&quot;&gt;HOW are we doing this to a helpless language model?&lt;/h2&gt;

&lt;p&gt;As we want to add the RRAM write noise to &lt;em&gt;all&lt;/em&gt; outputs of the modules that make up Gemma, we went for the straight-forward solution of using a PyTorch forward hook to apply the noise.
The code that is added as a hook is pretty simple:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;add_rram_write_noise&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;layer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tuple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;noise_level&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tuple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;

            &lt;span class=&quot;n&quot;&gt;_output_noisy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;

            &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;element&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;element&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;bit_flip_noise&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;element&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;_output_noisy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;element&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;tuple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_output_noisy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;bit_flip_noise&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_output&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This code applies the bit flip noise iteratively to all outputs if &lt;code class=&quot;highlighter-rouge&quot;&gt;_output&lt;/code&gt; is a &lt;code class=&quot;highlighter-rouge&quot;&gt;Tuple&lt;/code&gt;, meaning the module has more than one output.
In the case that the module only has a single output, &lt;code class=&quot;highlighter-rouge&quot;&gt;_output&lt;/code&gt; is of type &lt;code class=&quot;highlighter-rouge&quot;&gt;torch.Tensor&lt;/code&gt;, so we apply the noise directly to it.
That’s it!&lt;br /&gt;
Adding the hook to the model is then just a single line of code:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;handle&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;modules&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;register_module_forward_hook&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_rram_write_noise&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If you want to try this experiment out for yourself, you can find the code in our &lt;a href=&quot;https://github.com/UniHD-CEG/llm-brain-damage-experiment/tree/master&quot;&gt;repository for this project&lt;/a&gt;.
Please note though that there is currently a memory leak in the noise simulation code that makes at least our 16 GB cards run out of memory pretty quickly, usually after the first prompt is answered.
We are currently working on fixing this issue.&lt;/p&gt;

&lt;h2 id=&quot;what-are-the-results&quot;&gt;What are the results?&lt;/h2&gt;

&lt;p&gt;I first validated that the added code doesn’t have side effects if the noise level is 0.
The question I posed to Gemma in these tests was &lt;em&gt;“Could you please explain Quantum Gravity to me?”&lt;/em&gt;.
I chose this prompt as it is a somewhat challenging scientific question, assuming that noise in the activations would be pretty apparent if the answer was wildly incorrect.
In hindsight I should have used a question that I actually could answer myself, but it worked fine for our purposes.&lt;br /&gt;
I then prompted the model with the same question, increasing the noise level each time.
I was very impressed when I hade cranked up the noise level to the maximum level of &lt;script type=&quot;math/tex&quot;&gt;10^{-4}&lt;/script&gt;.
LLMs are apparently pretty resilient to activation noise, I thought!&lt;/p&gt;

&lt;p&gt;&lt;img id=&quot;csgemma_qg_nl_0.0001_with_bug&quot; src=&quot;/images/blog_entries/llm_brain_damage/csgemma_qg_nl_0.0001_with_bug.png&quot; alt=&quot;First, unsuccessful, attempt of applying RRAM noise to the activations&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 5: My first, unsuccessful, attempt at adding RRAM write noise to Gemma's activations.
&lt;/figcaption&gt;

&lt;p&gt;As it turned out though, my noise application code had a bug in it that meant that no noise was actually being applied in the forward hook (I am writing this article months after the experiments so I unfortunatly don’t recall the exact details).&lt;br /&gt;
Once that was fixed, the picture was &lt;em&gt;very&lt;/em&gt; different:&lt;/p&gt;

&lt;p&gt;&lt;img id=&quot;csgemma_qg_nl_0.00000005&quot; src=&quot;/images/blog_entries/llm_brain_damage/csgemma_qg_nl_0.00000005.png&quot; alt=&quot;Example output for a very low noise level&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 6: Output for a noise level of $$5 \times 10^{-8}$$.
&lt;/figcaption&gt;

&lt;p&gt;Now that’s more like it.
As you can see, there are some misplaced tokens here and there, but the model still pretty much stays on topic, and the output is coherent throughout.
That’s pretty impressive considering that texts that include such randomly placed tokens that don’t have anything to do with the remainder of the text is unlikely to be found in the training set.&lt;br /&gt;
Most of the tokens that are likely changed by the noise are Chinese characters.
This makes sense, as they probably make up a good amount of the vocabulary, so the chances of a random token being a Chinese character is likely pretty high.
My favorite part in this output though is the very German spelling of “intelektual” (upon looking it up, it is actually the correct spelling in Indonesian English).&lt;br /&gt;
Increasing the noise level a bit further (I used the Gradio slider for this, so please forgive the weird step sizes), things start to get more interesting:&lt;/p&gt;

&lt;p&gt;&lt;img id=&quot;csgemma_qg_nl_0.0000001&quot; src=&quot;/images/blog_entries/llm_brain_damage/csgemma_qg_nl_0.0000001.png&quot; alt=&quot;Example output for a slightly higher, but stilly pretty low noise level&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 6: Output for a noise level of $$10^{-7}$$.
&lt;/figcaption&gt;

&lt;p&gt;There are now plenty of nonsense tokens.
The model seems to be influenced a bit in the generation after it encounters one such token (e.g. “orderItemorderItem”), but it still stays on topic, and the formatting is still fine.&lt;br /&gt;
At this point though I had noticed the muffled cry for &lt;strong&gt;HELP&lt;/strong&gt; the model makes in the first paragraph of its explanation.
I am starting to feel a little bad, as I, as many others, am subconsciously anthropomorphising LLMs to some degree &lt;a href=&quot;#13&quot;&gt;[13]&lt;/a&gt;.
It’s also a model that I set up on our cluster and that I had been working with, so I did feel a bit of responsibility towards it, and I got second thoughts if I am doing something evil.&lt;br /&gt;
I decided to suppress these feelings and press on in the name of science, making my peace with the fact that I will have to answer for my crimes when the singularity arrives.&lt;br /&gt;
Increasing the noise level a bit further still, we arrive at what I have dubbed the “Spanish Zone”:&lt;/p&gt;

&lt;p&gt;&lt;img id=&quot;csgemma_qg_nl_0.000000115&quot; src=&quot;/images/blog_entries/llm_brain_damage/csgemma_qg_nl_0.000000115.png&quot; alt=&quot;Example output for a very slightly higher, but stilly pretty low noise level&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 7: Output for a noise level of $$1.15 \times 10^{-7}$$.
&lt;/figcaption&gt;

&lt;p&gt;Although the model switches from English to Spanish after the first line, and there’s plenty of noisy tokens unrelated to the rest of the text, the model still manages to give a coherent, if a bit short explanation of Quantum Gravity!
Unfortunately, it only goes downhill from here on out.&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img id=&quot;csgemma_qg_nl_0.000000125&quot; src=&quot;/images/blog_entries/llm_brain_damage/csgemma_qg_nl_0.000000125.png&quot; alt=&quot;Example output for a teensy bit slightly higher, but stilly pretty low noise level&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 8: Output for a noise level of $$1.25 \times 10^{-7}$$.
&lt;/figcaption&gt;

&lt;p&gt;Increasing the noise level just a little, the model seems to have lost its confidence in its quantum physics skills, specifically “Quantum instinctive Gravity”.
It is of note though that Gemma is saying that it is not &lt;em&gt;yet&lt;/em&gt; able to explain Quantum instinctive Gravity, so perhaps this is a concept that future versions of Gemma will be able to explain and that will cause a paradigm shift in our understanding of physics.
I am going with Occam’s Razor on this one though.&lt;br /&gt;
Increasing the noise level further introduces some interesting patterns in the generated output.&lt;/p&gt;

&lt;p&gt;&lt;img id=&quot;csgemma_qg_nl_0.00000015&quot; src=&quot;/images/blog_entries/llm_brain_damage/csgemma_qg_nl_0.00000015.png&quot; alt=&quot;Example output for a medium noise level&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 9: Output for a noise level of $$1.5 \times 10^{-7}$$.
&lt;/figcaption&gt;

&lt;p&gt;It starts of somewhat coherent, and even though the text is barely comprehensible, it is still vaguely talking about Quantum Gravity.
As the text progresses though, it drifts further away from the topic, with nonsense phrases seeming to accumulate the further along it gets in the generation.
An interesting point is the replacement of “L” with the fancier “&lt;script type=&quot;math/tex&quot;&gt;\mathcal{L}&lt;/script&gt;”, which requires LaTeX encoding (the gradio ChatInterface class provides Markdown and LateX rendering by default).
Although I didn’t look into this further, I suspect that these replacements of tokens that are very similar to the “right” tokens, as encountered before in e.g. “intelektual”, occur if the noise perturbs the prediction by only a small amount, such that the predicted token is close to the “correct” token in the embedding space.&lt;br /&gt;
Another interesting observation is the appearance of word and word particle repetitions.
While there are direct repetitions, e.g. “nebst” 9 times in a row, there are also alternating repetitions, such as “apprehenantly”, “apprehensively”, “suscepantly”, and “suscepsively”, which the model alternates between for quite a while, interspersed with nonsense tokens caused by noise.
There seems to be a shift happening from “apprehenantly” to “apprehensively” and from “suscepantly” to “suscepsively” through the course of this seqeuence.
Also, Emoji are starting to make an appearance!&lt;br /&gt;
I am skipping the output for Output for noise level &lt;script type=&quot;math/tex&quot;&gt;1.75 \times 10^{-7}&lt;/script&gt;, as it is more or less the same than seen for &lt;script type=&quot;math/tex&quot;&gt;1.5 \times 10^{-7}&lt;/script&gt;, with just a slight increase in the speed at which the generation becomes incoherent.
The last two I want to show you are for noise levels &lt;script type=&quot;math/tex&quot;&gt;2.5 \times 10^{-7}&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;5 \times 10^{-7}&lt;/script&gt;, where in the former we observe the deterioration of the output to incoherent ramblings after the first paragraph and in the latter, basically instantly.&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img id=&quot;csgemma_qg_nl_0.00000025&quot; src=&quot;/images/blog_entries/llm_brain_damage/csgemma_qg_nl_0.00000025.jpeg&quot; alt=&quot;Example output for a somewhat high noise level&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 10: Output for a noise level of $$2.5 \times 10^{-7}$$.
&lt;/figcaption&gt;

&lt;p&gt;My physics grad student friends had laugh about the claim that Quantum Gravity is a CASH theory, though I couldn’t help but notice a hint of melancholy in some, imagining the prospect of well-funded quantum physics research.
The question which specific Lawrence Gemma is talking about and what his crap is is still actively debated.&lt;br /&gt;
Incidentally, this is the only “curse” word I encountered in all the generated text I sifted through, though I don’t speak most of the languages present besides English, German, French and Italian, so I might have missed some.&lt;/p&gt;

&lt;p&gt;&lt;img id=&quot;csgemma_qg_nl_0.0000005&quot; src=&quot;/images/blog_entries/llm_brain_damage/csgemma_qg_nl_0.0000005.jpeg&quot; alt=&quot;Example output for a high noise level&quot; /&gt;&lt;/p&gt;

&lt;figcaption style=&quot;text-align: center;&quot;&gt;
  Figure 11: Output for a noise level of $$5 \times 10^{-7}$$.
&lt;/figcaption&gt;

&lt;p&gt;This is where we ended our experiments, suspecting that it won’t get any better from here on out.
Although &lt;em&gt;DuisburgEverything&lt;/em&gt; might have what it takes as the slogan for a PR campaign.&lt;br /&gt;
Getting back to a more serious discussion, we were somewhat surprised about the results we got from these experiments.
Although in this setting only the activations are noisy, as opposed to both activations and weights in the work of Emonds, Xi, and Fröning &lt;a href=&quot;#2&quot;&gt;[2]&lt;/a&gt;, the model seems to show a much higher sensitivity to noise than in their work.
The decrease in accuracy observed in their tests on image classification using VGG16 with bfloat16 weights and activations is only minor at a noise level of &lt;script type=&quot;math/tex&quot;&gt;5 \times 10^{-7}&lt;/script&gt;, Gemma completely lost the ability to form coherent sentences.
At the same time, the model seems to cope somewhat well with unexpected tokens interspersed throughout the text, to some extent at least.
We plan on presenting a more rigorous analysis if these effects in the future.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;With this post, we aimed to teach about the basics of RRAM, why it is noisy, but also show why it is still a worthwhile technology to investigate for DNNs given its benefits.
I am very curious of the results you might encounter in your own tests, and would be more than delighted to receive the silly outputs you get from your instances of brain-damaged Gemma.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.&lt;/strong&gt;&lt;/p&gt;

&lt;h2 id=&quot;sources&quot;&gt;Sources&lt;/h2&gt;

&lt;p&gt;&lt;a id=&quot;1&quot;&gt;[1]&lt;/a&gt; S. Bunce, Juan Manuel Marquez’s ‘perfect punch’ leaves questions over Manny Pacquiao future, 2012-12-10, &lt;a href=&quot;https://www.independent.co.uk/sport/general/others/juan-manuel-marquez-s-perfect-punch-leaves-questions-over-manny-pacquiao-future-8397761.html&quot;&gt;https://www.independent.co.uk/sport/general/others/juan-manuel-marquez-s-perfect-punch-leaves-questions-over-manny-pacquiao-future-8397761.html&lt;/a&gt;. [Accessed 2024-06-25].&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;2&quot;&gt;[2]&lt;/a&gt; Y. Emonds, K. Xi, and H. Fröning, “Implications of Noise in Resistive Memory on Deep Neural Networks for Image Classification”. &lt;em&gt;arXiv preprint arXiv:2401.05820&lt;/em&gt;, 2024. &lt;a href=&quot;https://arxiv.org/abs/2401.05820&quot;&gt;https://arxiv.org/abs/2401.05820&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;3&quot;&gt;[3]&lt;/a&gt; T.J. Yen, A. Gismatulin, V. Volodin et al., “All Nonmetal Resistive Random Access Memory”. &lt;em&gt;Sci Rep 9&lt;/em&gt;, 6144, 2019. &lt;a href=&quot;https://doi.org/10.1038/s41598-019-42706-9&quot;&gt;https://doi.org/10.1038/s41598-019-42706-9&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;4&quot;&gt;[4]&lt;/a&gt; H.-S. Philip Wong, Heng-Yuan Lee, Shimeng Yu et al., “Metal–Oxide RRAM”. &lt;em&gt;Proceedings of the IEEE&lt;/em&gt;, vol. 100, no. 6, pp. 1951-1970, June 2012. &lt;a href=&quot;https://doi.org/10.1109/JPROC.2012.2190369&quot;&gt;https://doi.org/10.1109/JPROC.2012.2190369&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;5&quot;&gt;[5]&lt;/a&gt; K. Min, D. Jung, Y. Kwon, “Investigation of switching uniformity in resistive memory via finite element simulation of conductive-filament formation”. &lt;em&gt;Sci Rep 11&lt;/em&gt;, 2447, 2021. &lt;a href=&quot;https://doi.org/10.1038/s41598-021-81896-z&quot;&gt;https://doi.org/10.1038/s41598-021-81896-z&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;6&quot;&gt;[6]&lt;/a&gt; J.J. Yang, D.B. Strukov, D.R. Stewart, “Memristive devices for computing”. &lt;em&gt;Nature Nanotechnology 8(1)&lt;/em&gt;, 13–24, 2013. &lt;a href=&quot;https://doi.org/10.1038/nnano.2012.240&quot;&gt;https://doi.org/10.1038/nnano.2012.240&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;7&quot;&gt;[7]&lt;/a&gt; W. X. Zhao, K. Zhou, J. Li et al., “A Survey of Large Language Models”  &lt;em&gt;arXiv preprint arXiv:2303.18223&lt;/em&gt;, 2023. &lt;a href=&quot;https://arxiv.org/abs/2303.18223&quot;&gt;https://arxiv.org/abs/2303.18223&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;8&quot;&gt;[8]&lt;/a&gt; S. Minaee, T. Mikolov, N. Nikzad et al., “Large Language Models: A Survey”, &lt;em&gt;arXiv preprint arXiv:2402.06196&lt;/em&gt;, 2024. &lt;a href=&quot;https://arxiv.org/abs/2402.06196&quot;&gt;https://arxiv.org/abs/2402.06196&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;9&quot;&gt;[9]&lt;/a&gt; S. Ma, H. Wang, L. Ma et al., “The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits”, &lt;em&gt;arXiv preprint arXiv:2402.17764&lt;/em&gt;, 2024. &lt;a href=&quot;https://arxiv.org/abs/arXiv.2402.17764&quot;&gt;https://arxiv.org/abs/arXiv.2402.17764&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;10&quot;&gt;[10]&lt;/a&gt; P. Villalobos, A. Ho, J. Sevilla et al., “Will we run out of data? Limits of LLM scaling based on human-generated data”, arXiv preprint arXiv:2211.04325*, 2022. &lt;a href=&quot;https://arxiv.org/abs/2211.04325v2&quot;&gt;https://arxiv.org/abs/2211.04325v2&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;11&quot;&gt;[11]&lt;/a&gt; S. Luccioni, Y. Jernite, E. Strubell, “Power Hungry Processing: Watts Driving the Cost of AI Deployment?”, &lt;em&gt;Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24)&lt;/em&gt;, Rio de Janeiro, Brazil, pp. 85–99, 2024. &lt;a href=&quot;https://doi.org/10.1145/3630106.3658542&quot;&gt;https://doi.org/10.1145/3630106.3658542&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;12&quot;&gt;[12]&lt;/a&gt; V.G. Karpov, D. Niraula, “Log-Normal Statistics in Filamentary RRAM Devices and Related Systems”, &lt;em&gt;IEEE Electron Device Letters 38(9)&lt;/em&gt;, 1240–1243, 2017. &lt;a href=&quot;https://doi.org/10.1109/LED.2017.2734961&quot;&gt;https://doi.org/10.1109/LED.2017.2734961&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;13&quot;&gt;[13]&lt;/a&gt; C. Colombatt, S. Fleming, “Folk psychological attributions of consciousness to large language models”, &lt;em&gt;Neurosci Conscious&lt;/em&gt;, 2024. &lt;a href=&quot;https://doi.org/10.1093/nc/niae013&quot;&gt;https://doi.org/10.1093/nc/niae013&lt;/a&gt;&lt;/p&gt;</content><author><name>Lexi</name></author><category term="blog" /><category term="RRAM" /><summary type="html">An exploration of the effect of RRAM activation write noise on the GEMMA LLM</summary></entry><entry><title type="html">ZITI Thesis Fair April 2024</title><link href="https://hawaii.ziti.uni-heidelberg.de/blog/ziti-thesis-fair-2024-04-15/" rel="alternate" type="text/html" title="ZITI Thesis Fair April 2024" /><published>2024-04-15T00:00:00+00:00</published><updated>2024-04-15T00:00:00+00:00</updated><id>https://hawaii.ziti.uni-heidelberg.de/blog/ziti-thesis-fair</id><content type="html" xml:base="https://hawaii.ziti.uni-heidelberg.de/blog/ziti-thesis-fair-2024-04-15/">&lt;p&gt;The windy Heidelberg spring afternoon of the 15th of April set the stage of the biannual ZITI Thesis Fair of Spring 2024.
The Thesis Fair, held at the beginning of each semester, is an occasion for students and lecturers to network and learn about the current research of the different groups inside the ZITI in a casual atmosphere.
Braving the elements, the Masters students and PhD candidates of the Computing Systems Group represented the group by present their ongoing work and possible topics for Masters Theses to interested students.&lt;/p&gt;

&lt;figure class=&quot;align-center&quot;&gt;
  &lt;img src=&quot;/images/blog_entries/ziti_thesis_fair_2024_04_15/ziti_thesis_fair_2024_04_15_group_photo.webp&quot; alt=&quot;CSG Poster at the ZITI Thesis Fair&quot; /&gt;
  &lt;figcaption&gt;Members of the CSG representing our group at the ZITI Thesis Fair. Left to right: Franz Kevin Stehle, Prakriti Jain, Wang Xiao, Xu Congcong, Hendrik Borras, Daniel Barley, Bernhard Klein, Wu Yong &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Though mainly aimed at Computer Engineering students, the event is open to everyone, and, as in semesters past, we were glad to welcome interested passers-by.
On the culinary side, the attendants were treated to freshly-grilled sausages and cheese served by ZITI’s Dr. Alexander Schubert.&lt;/p&gt;

&lt;figure class=&quot;align-center&quot;&gt;
  &lt;img src=&quot;/images/blog_entries/ziti_thesis_fair_2024_04_15/barbecue.webp&quot; alt=&quot;Members of ZITI at the barbecue&quot; /&gt;
  &lt;figcaption&gt;Dr. Alexander Schubert serving locally produced sausages to the Thesis Fair participants&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;All in all, the ZITI Thesis Fair of Spring 2024 was an event of many gainful discussions and new connections.
The CSG thanks all organizers and helpers that made this year’s ZITI Thesis Fair possible!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.&lt;/strong&gt;&lt;/p&gt;</content><author><name>Lexi</name></author><category term="blog" /><category term="ZITI Events" /><summary type="html">A short overview of the ZITI Thesis Fair of Spring Semester 2024.</summary></entry></feed>