April 18, 2024

A case of distributed computing for LLM in Fintech

The previous year i.e. 2023 has clearly been a standout year in terms of advancements in the field of AI. Traditionally, it has always been thought that to make the most of AI, a strong investment in infrastructure and support is needed. It has never been clearer than this past year due to the advent of generative AI. Most traditional AI technology prior to Gen AI worked reasonably well on a handful of GPUs and RAM. This all changed after the release of GPT-3 by Open AI and the subsequent release of a large number of open source models. These large language models were big in every way, they needed massive computing resources in the form of high performance GPU and huge memory in terms of RAM. The financial services sector in particular is recognized as the main beneficiary of this technology. The amount of resources used in this sector in analyzing and processing data, particularly textual data, can be greatly optimized through LLM. In fact, it is open source LLMs that have found their greatest use in this sector. There are multiple reasons for this.

(a) Criticality of data and its security: Much data in the financial sector is confidential. They must be secured and public access prevented. The possible leak of this data can cause serious problems for the company. Advocates in-house or open source solutions instead of proprietary solutions, especially for critical and sensitive use cases.

(b) Customization of LLM: Most use cases in this sector require customizing LLM models with very specific data sets that vary from company to company to provide the right answer.

It is quite evident that the applicability of open source LLM in the financial sector is increasing, but at the same time there are many challenges in the basic implementation of the LLM solution. The large amount of resources required in terms of computing power and memory is expensive and difficult to maintain. Take the recent milestone of the Big Science project’s introduction of BLOOM, a model with 176 billion parameters capable of supporting 46 natural languages ​​and 13 programming languages. While the public accessibility of these 100+ billion parameter models has facilitated their use, the associated challenges of high memory and computational costs remain. In particular, models like the OPT-175B and BLOOM-176B require more than 350 GB of accelerator memory for inference and even more for tuning. Consequently, practical utilization of such LLMs often requires multiple high-end GPUs or multi-node clusters, which, due to their high costs, limits accessibility for many researchers and practitioners.

This justifies the need to try completely different perspectives all together, as they say
Thinking outside the box.

Client – ​​Server Approach

This justifies setting up distributed computing for LLMs as one of the possible solutions. It also makes sense because we already use normal distributed computing systems, like cloud and edge computing. This facilitates collaboration between multiple users to perform inference and tune large language models over the Internet. Participants in a distributed network can assume the roles of server, client, or both. A server is responsible for hosting a subset of model layers, typically Transformer blocks, and for handling client requests. Clients, in turn, can form a chain of consecutive servers in parallel to run inference of the entire model. Beyond inference, tuning activities can be performed using parameter-efficient training methods, such as adapters, or by training entire layers. Trained submodules can be shared in a model hub, where others can leverage them for inference or additional training. This demonstrates the efficient execution of over 100 billion existing models in this collaborative environment, with the help of various optimizations such as dynamic quantization, low-latency connection prioritization, and load balancing between servers. Let’s look at this in a little more detail.

Design and Technical summary

Practical applications of large language models can be broadly classified into two main scenarios: inference and efficient adaptation of parameters to downstream tasks. I would attempt to outline the design of a distributed network, clarifying how it effectively handles both scenarios and facilitates the seamless exchange of capable adapters between users of the system.

  • Billion scale model inference: In the token generation process, a client stores the model’s token embeddings locally, which typically constitutes a small fraction of the total parameter count and fits comfortably in the RAM of most laptops, servers, and workstations. modern. The client relies on servers to execute Transformer blocks, and each server hosts several consecutive blocks, the number of which is determined by the available GPU memory. Before each inference session, the client establishes a chain of servers that together span all layers of the model. During the active session, the client uses the local embedding layer to retrieve embedding vectors for prefix tokens, transmitting these vectors to the servers and receiving updated representations. After getting the results of the final block, the client calculates the probabilities of the next token and iterates through this process. Servers retain attention keys and values ​​from previous client inputs for subsequent inference steps, and clients store past inputs on each server to facilitate quick replacement if a server fails or goes offline.
  • Training for subsequent tasks: While large language models (LLMs) excel at many problems with simple and fast engineering, achieving optimal results often requires training. Traditional tuning strategies, which involve updating all model parameters for the subsequent task, become impractical for very large models due to extensive hardware requirements. For example, tuning BLOOM-176B would require almost 3TB of GPU memory to accommodate the model, gradients, and optimizer states. To address this challenge, the NLP community has devised efficient parameter tuning methods that preserve most of the parameters of the pre-trained model. Some approaches select a subset of existing parameters, while others augment the model with additional trainable weights. Despite lower memory requirements, these parameter-efficient approaches often compete favorably with full model fitting and can outperform it in data-poor scenarios.
  • Distributed adjustment: The fundamental idea behind fine-tuning in a distributed network is that clients own the trained parameters, while servers host the original pre-trained layers. Servers can run backpropagation through their layers, returning gradients related to activations, but they do not update server-side parameters. This allows clients to simultaneously run different training tasks on the same set of servers without interference.

Internal structure and optimizations

Performance considerations are paramount for distributed inference, and involve three key aspects: compute speed (comparing a 5-year-old gaming GPU to a new data center GPU), communication delay due to node distance ( intercontinental versus local) and induced bandwidth. communication delay (10 Mbit/s vs 10 Gbit/s). While even consumer GPUs like the GeForce RTX 3070 have the ability to run a full BLOOM-176B inference step in less than a second, the challenge lies in GPU memory limitations, which require efficient solutions. One way to address this is by employing quantization to optimize parameter storage and dynamic server prioritization to improve communication speed.

  • Consumer GPU Usage: Considering the fact that each server has at least 16 GB of CPU RAM and 8 GB of GPU memory, the main goal is to minimize the memory footprint of the model, allowing each device to accommodate more Transformer blocks. For BLOOM with parameters of 176B, which requires 352 GB of GPU memory with 16-bit precision, we can optimize this by compressing hidden states using dynamic block quantization and reducing weights to 8-bit precision using mixed matrix decomposition. This results in a substantial reduction in the required number of nodes, effectively cutting latency in half and minimizing the probability of failure.
  • Capture Communication Shock absorbers: We can use dynamic block quantization on hidden states before pipeline parallel communication, halving bandwidth requirements without compromising generation quality.
  • Model weight compression: Using 8-bit mixed matrix decomposition for matrix multiplication reduces the memory footprint by approximately half without sacrificing quality.
  • Collaborating over the Internet: To ensure reliable inference and training despite nodes joining, leaving, or failing. We can use the hivemind library for decentralized training and custom fault-tolerant protocols for servers and clients.

Democratization and privacy concerns

We can take inspiration from Blockchain to address the potential imbalance between peers that supply GPU resources (servers) and those that use these servers for inference or tuning. To address this, an incentive system could be implemented. Peers running servers could earn special points, redeemable for high-priority inferences and adjustments or other rewards. This approach aims to encourage active participation and maintain a balanced network. A recognized limitation of our current approach is the potential privacy concern, where peers providing services in the initial layers of the model could leverage inputs to retrieve input tokens. One way to address this is that users handling sensitive data are recommended to limit their clients to trusted servers or set their swarm isolated. Although we can explore privacy-enhancing technologies, such as secure multi-party computing or privacy-preserving hardware from NVIDIA.


My goal through this blog is to present my vision on Distributed Computing for AI and explain why it is necessary and a brief technical overview on a possible approach to implement it. I am open to discussing new ideas to implement this. Considering the fact that there will be massive application of AI in the financial sector in the coming years, we have to start thinking about how we can optimally utilize current resources before creating new ones. The other goal is to democratize access to large language models, enabling a broader range of applications, studies, and research questions that were previously challenging or cost-prohibitive.

Leave a Reply

Your email address will not be published. Required fields are marked *