Fine-tune a large language model (llm) for multi-turn conversations and run it on a Text Generation Inference (TGI) server (2024)

This blog post is about the initial fine-tuning process for a large language model (llm) for multi-turn conversations and running the fine-tuned model on a Text Generation Inference (TGI) server on an IBM Cloud Virtual Server Instance. It also covers the entire process from the training of the model until the model is ready to be tested.

LLMs are becoming popular for multi-turn conversations with the introduction of ChatGPT’s interactive chat experience.

The blog post covers briefly several key topics, including knowing the use case, information resources, and tasks when moving from fine-tuning to running the model until the deployment of the fine-tuned model in an enterprise environment using watsonx.ai.

Table of content

Example for multi-turn
Resources
Topics related to fine-tune a llm
Knowing our use case
Knowing why we need to fine-tuning a model
Knowing the type of model we need to address our use case
Knowing the golden ground truth for valid, invalid, out-of-topic conversation flows for the use case
Knowing the data output format, we wish that our fine-tuning produces
Knowing the data format for the training data and testing data we want to use for the fine-tuning to achieve our needed output format
Preparing the train/test data using synthetic data generation if we don’t have enough data
Preparing and running the setup of a Virtual Server Instance on IBM Cloud with GPUs
Selecting the Libraries we want to use for the training and run the training
1. Prepare the fine-tuning by the installation of the needed libraries
2. Implement the fine-tuning
3. Run the fine-tuning
Run the fine-tuned model on a Text Generation Inference (TGI) server
Implementing or using existing evaluation/testing frameworks to test/evaluate the fine-tuned model
Define the metrics you want to use to display your evaluation results
Running our own fine-tuned model in a robust enterprise environment using watsonx.ai on-premise in a Cloud Pak for Data instance in a Virtual Private Cloud on AWS or on IBM Cloud
Summary

1. Example for multi-turn

Here is an example of one multi-turn conversation flow to get some weather data. We want to ensure that only the weather topic is relevant for the multi-turn conversation. The user can request to change the resulting data of the weather topic by refining his last question, with a new one, but we don’t expect that the user asks off-topic-questions for our defined multi-turn LLM configuration scenario.

Assistant represents the answer to a question a User has, our model should later address:

Multi-turn flow

Step	Role	Content	Notes
1	`User`	Can you please give me all your weather forecast data?	Getting the initial data.
	`Assistant`	Here is the data.	The fine-tuned model provides a valid SQL query, that will be used by an application to query the needed data from a system and displays the data to the user.
2	`User`	Can you please reduce the displayed data to the `US-south` region, excluding the city Dallas?	Reduce the result of the data to display.
	`Assistant`	Here is the data.	The fine-tuned model provides a valid SQL query, that will be used by an application to query the needed data from a system and displays the data to the user.
3	`User`	Who has won the soccer world cup in 1954?	Here the user is asking off-topic-questions, this type of questions must be covered in a multi-turn flow, that means we must be able to handle this.
	`Assistant`	I can only help you with weather data. Can you please rephrase your question?

2. Resources

I used various information resources as input to write my blog post. I want to highlight these three excellent blog posts in that context.

How to Fine-Tune LLMs in 2024 with Hugging Face by Phil Schimd
Fine-tuning LLMs via Hugging Face on IBM Cloud by Niklas Heidloff
Deploying LLMs via Hugging Face on IBM Cloud by Niklas Heidloff

3. Topics related to fine-tuning a llm

The following list contains the topics we need to take care of when we are moving from fine-tuning to running the fine-tuned model:

Knowing the“golden" ground truthcontaining valid, invalid, out-of-topic conversation flows for the use case
Knowing the data output format, we wish that our fine-tuning produces
Knowing the data format for the train/test data we want to use for the fine-tuning to achieve our needed output format
Preparing the train/test data usingsynthetic data generationif we don’t have enough data
Preparing and running the setup of a Virtual Server Instance on IBM Cloud with GPUs
Selecting the Libraries we want to use for the training and run the training
Run the fine-tuned model on aText Generation Inference (TGI)server
Implementing or using existing evaluation/testing frameworks to test/evaluate the fine-tuned model

4. Knowing our use case

We must have a reason why we want to use a large language model(LLM) to handle conversations. We know that data extraction is necessary in many business scenarios. For example, apersonathat needs to extract data from a database is a good case, and the solution in the Phil SchimdText-to-SQLblog post fits our needs.

5. Knowing why we need to fine-tune a model

One of the main reasons for using a fine-tuned model is customization for specific tasks in combination with optimizing runtime cost by fine-tuning a smaller model for a particular task. Another reason can be to minimize the prompt token sizes or, in an optimal situation no prompt is needed, when a user or system interacts with the model.

6. Knowing the type of model we need to address our use case

We may havemanydifferentpotential use cases for our business; for example summarization, categorization, or more.

The question often is: Which is the right model for our use case?

Let us assume we need to transform text to SQL, as mentioned in Phil Schimd’s blog post. We benchmark different models for text to SQL, and ask us the question: Can we easily fine-tune them? Resources to find information for Text-to-SQL can be found for example at Defog, or Hugging face LLM leader board. Assume we select theMistralmodel for fine-tuning also as others did this before here is a blog post:Fine-Tuning the LLM Mistral-7b for Text-to-SQL with SQL-Create-Context Dataset, so that’s a way we can find a starting point for how to address our use case.

7. Knowing the “`golden" ground truth` for valid, invalid, out-of-topic conversation flows for the use case

We want to be able to measure the quality and accuracy after of our fine-tuning for our use case “providing weather data”. We need to be able that the response of our model is right.

We need now to have test data called "golden" ground truth.

The golden ground truth should consist of valid, invalid, out-off-topic conversation flows. We must provide this data input because we should know our use case at best; in many situations, others (consultants or other resources) can’t generate it for us because they are not familiar with our use case, and the consultants or other resources are adding additional then costs to our project.

That is a massive task because having the correct data and the needed amount of data, for example, at a minimum 1000 to train/test 1000 is a number we can find when we google for a minimum amount of dataset to fine-tune llm.

8. Knowing the data output format, we wish that our fine-tuning produces

Finally, our fine-tuned llm will integrated by the development into an application that implements the interaction with a user, external system, and internal system for our use case. This application will parse the raw data output of our model, which results from a response of a REST or GRPC call to the platform our model runs. We should define a format that is the best in our situation.

An example response or the answer can be a JSON format generated by agents.

{ "tool_name": "final answer", "sql": "SELECT * FROM WEATHER_DATA"}

For details, visit Mixtral Agents with Tools for Multi-turn Conversations by Niklas Heidloff

9. Knowing the data format for the training data and testing data we want to use for the fine-tuning to achieve our needed output format

In our example, we will use the following data format for thetrainingdata, that is called the conversation format (JSONL).

{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

The basic structure for the training formatwe canfindin the blog postHow to Fine-Tune LLMs in 2024 with Hugging Face from (Phil Schimd)

In the JSON below, we see an array of messages that contains key value pairs of“role”and“content.

Arolecan be asystemrepresenting an essential prompt content for the model. When the role has the value“user” it represents that the “content” is a question a user can ask. The role of theassistantrepresents the answer content of the model.

So, this example training data set represents asingle turnflow because we have only one question and one response.

{"messages": [{"role": "system", "content": "You are a weather data expert."}, {"role": "user", "content": "Can you please give me all your weather forecast data?"}, {"role": "assistant", "content": "```json{\"tool_name\": \"final answer\",\"sql\": \"SELECT * FROM WEATHER_DATA\"}```"}]}

A multi-turn flow looks like this in the JSON below:

{"messages": [{"role": "system", "content": "You are a weather data expert."}, {"role": "user", "content": "Can you please give me all your weather forecast data?"}, {"role": "assistant", "content": "```json{\"tool_name\": \"final answer\",\"sql\": \"SELECT * FROM WEATHER_DATA\"}```"}, {"role": "user", "content": "Who has won the soccer world cup in 1954?"}, {"role": "assistant", "content": "```json{\"tool_name\": \"refiner\",\"input\": \"I can only help you with weather data. Can you please rephrase your question?\"}```"}]}

You can use the same format for the test data, but you only use the most critical data of the User and Assistant.

Flow number	User	Assistant	User	Assistant
1	Can you please give me all your weather forecast data?	`SELECT * FROM WEATHER_DATA`
2	Can you please give me all your weather forecast data?	`SELECT * FROM WEATHER_DATA`	Who has won the soccer world cup in 1954?	`I can only help you with weather data. Can you please rephrase your question?`

10. Preparing the train/test data using `synthetic data generation` if we don’t have enough data

There can be situations where we don’t have 1000 data sets for fine-tuning, but we can use LLMs to generate synthetic data. It is fantastic how far this generation can go. However, we must keep in mind that we should be able to validate the correctness of the content of the generated synthetic data.

The blog post Generating Synthetic Data with Large Language Models from Niklas Heidloff can be helpful in this context.

11. Preparing and running the setup of a Virtual Server Instance on IBM Cloud with GPUs

If our local machine does not have enough power for training and running a fine-tuned LLM model, we can use for example a Virtual Server Instance on IBM Cloud with GPUs, and we can run our model later with the Text Generation Inference (TGI) from hugging face.

Here are two blog posts that could be helpful for these tasks:

How do you initially set up a Virtual Server Instance with a GPU in IBM Cloud?
Getting started with Text Generation Inference (TGI) using a container to serve your LLM model

12. Selecting the Libraries we want to use for the training and run the training

We select the Supervised Fine-tuning Trainer from Hugging Face for the model training and datasets to manage the data for the training split. Most of the fine-tuning implementation we reuse from the blog posts How to Fine-Tune LLMs in 2024 with Hugging Face written by Phil Schimd and Fine-tuning LLMs via Hugging Face on IBM Cloud written by Niklas Heidloff.

The Supervised Fine-tuning Trainer is widely used in the context of fine-tuning models.

12.1. Prepare the fine-tuning by the installation of the needed libraries

Some descriptions of the installed libraries on Hugging Face:

Supervised fine-tuning (or SFT for short) is a crucial step in RLHF (methodology for integrating human data labels into a RL-based optimization process.) in Hugging Face. The TRL – Transformer Reinforcement Learning of Hugging Face provides an easy-to-use API to create SFT models and train them with a few lines of code for a given dataset.
Transformers, “for example, _provides APIs _to download quickly and pre-trained_ models on a given text, fine-tune them on custom datasets, and share the model with the community on the Hugging Face model hub. At the same time, each Python module defining an architecture is fully standalone and can be modified to enable quick research experiments.”_
datasets, “provide one-liners to download and pre-process any of the number of datasets major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the HuggingFace Datasets Hub. “
PEFT (Parameter-Efficient Fine-Tuning) “is a library for efficiently adapting large pre-trained models to various downstream applications without fine-tuning all of a model’s parameters because it is prohibitively costly. PEFT methods only fine-tune a small number of (extra) model parameters – significantly decreasing computational and storage costs – while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware.”

python3 -m pip install "torch==2.1.2"python3 -m pip install --upgrade "transformers==4.36.2" "datasets==2.16.1" "accelerate==0.26.1" "evaluate==0.4.1" "bitsandbytes==0.42.0"python3 -m pip install git+https://github.com/huggingface/trl@a3c5b7178ac4f65569975efadc97db2f3749c65e --upgradepython3 -m pip install git+https://github.com/huggingface/peft@4a1559582281fc3c9283892caea8ccef1d6f5a4f --upgrade

12.2. Implement the fine-tuning

Here is an example code based on the content of the two blog posts How to Fine-Tune LLMs in 2024 with Hugging Face written by Phil Schimd, and Fine-tuning LLMs via Hugging Face on IBM Cloud written by Niklas Heidloff.

The code does the following steps:

Load training data and split the data only to train
- We remember that we maybe need to modify the training data input using the train_dataset.map function.
Configure the Bits and Bites quantization
Load the pre-trained model
Use for the training the chat format a function from Transformer Reinforcement Learning
Prepare the Supervised fine-tuning (or SFT for short)
Train the model
Clean-up

import argparseimport torchimport refrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArgumentsfrom trl import setup_chat_formatfrom datasets import load_datasetfrom peft import LoraConfig from trl import SFTTrainerdef format_data(input): list = [] list = input['messages'] i = 0 for item in list: if (str(item['role']) == "assistant"): update = re.sub('X','XXX', str(item['content'])) #depends on your data item['content'] = update list[i] = item i = i + 1 input['messages'] = list result = input print(f"Create conversion after: {result}") return resultdef main(args): # Base model id model_id = "mistralai/Mistral-7B-Instruct-v0.2" # Finetuned model id output_directory="/output/" peft_model_id=output_directory+"model" # Training data train_data_file="/synthetic_data/synthetic_data_generated.jsonl" # Load training data and split train_dataset = load_dataset("json", data_files=train_data_file, field='messages', split="train") train_dataset = train_dataset.map( format_data, batched=False) torch.utils.checkpoint.use_reentrant=True # Added by Thomas # Configure the Bits and Bites quantization bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16 # Change from Niklas / Different from Phil Schmid's blog post ) # Load the pre-trained model model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16, # Change from Niklas / Different from Phil Schmid's blog post quantization_config=bnb_config ) model.config.use_cache = False # Added by Thomas tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.padding_side = 'right' # Use for the training the chat format model, tokenizer = setup_chat_format(model, tokenizer) peft_config = LoraConfig( lora_alpha=128, lora_dropout=0.05, r=128, # Change from Niklas / Different from Phil Schmid's blog post bias="none", target_modules="all-linear", task_type="CAUSAL_LM" ) args = TrainingArguments( output_dir=output_directory+"checkpoints", # The output directory where the model predictions and checkpoints will be written. logging_dir=output_directory+"logs", # Tensorboard log directory. Will default to runs/**CURRENT_DATETIME_HOSTNAME**. logging_strategy="steps", logging_steps=250, evaluation_strategy="steps", # Added by Thomas eval_steps=1000, # Added by Thomas save_steps=1000, # Number of updates steps before two checkpoint saves. num_train_epochs=12, # Total number of training epochs to perform. per_device_train_batch_size=3, # The batch size per GPU/TPU core/CPU for training. gradient_accumulation_steps=2, # Number of updates steps to accumulate the gradients for, before performing a backward/update pass. gradient_checkpointing=True, gradient_checkpointing_kwargs={"use_reentrant":False},# Added by Thomas optim="adamw_torch_fused", save_strategy="epoch", learning_rate=2e-4, # The initial learning rate for Adam. fp16=True, # Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training. Change from Niklas / Different from Phil Schmid's blog post max_grad_norm=0.3, # Maximum gradient norm (for gradient clipping). warmup_ratio=0.03, # Number of steps used for a linear warmup from 0 to learning_rate. lr_scheduler_type="constant", push_to_hub=False, # Change from Niklas / Different from Phil Schmid's blog post auto_find_batch_size=True # Change from Niklas / Different from Phil Schmid's blog post ) # Supervised fine-tuning (or SFT for short) max_seq_length = 3072 trainer = SFTTrainer( model=model, args=args, train_dataset=train_dataset, peft_config=peft_config, max_seq_length=max_seq_length, # maximum packed length tokenizer=tokenizer, packing=True, dataset_kwargs={ "add_special_tokens": False, "append_concat_token": False, } ) # Train the model trainer.train() # Save the model and tokenizer trainer.model.save_pretrained(peft_model_id) tokenizer.save_pretrained(peft_model_id) # Clean-up del model del trainer torch.cuda.empty_cache()if __name__ == "__main__": parser = argparse.ArgumentParser() args = parser.parse_args() main(args)parser.parse_args()

One wording definition for the given source code:

Epoch: An essential notion in real-time programming. Generally, several 11 epochs are ideal for training on most datasets. Learning optimization is based on the iterative process of gradient descent. Epoch an Essential Notion on DataScientest

12.3. Run the fine-tuning

python3 finetune.py

Example output for only three epochs

Generating:

Generating train split: 14 examples [00:00, 484.39 examples/s]

Checkpoint:

Loading checkpoint shards: &nbsp;67%|████████▋ &nbsp; | 2/3 [03:21<01:40, 101.00s/it]

Final result:

...{'train_runtime': 168.1095, 'train_samples_per_second': 0.214, 'train_steps_per_second': 0.036, 'train_loss': 0.9815422693888346, 'epoch': 3.0}100%|███████████████████████████████████████████████████████████| 6/6 [02:48<00:00, 28.02s/it]...

After the model is fine-tuned, we find a new folder that contains the fine-tuned model:

Fine-tune a large language model (llm) for multi-turn conversations and run it on a Text Generation Inference (TGI) server (1)

Remember, our defined model name in the source code before was model.

peft_model_id=output_directory+"model"

13. Run the fine-tuned model on a Text Generation Inference (TGI) server

The setup Text Generation Inference (TGI) we already did in “8. Preparing and running the setup of a Virtual Server Instance on IBM Cloud with GPUs“.

The following code shows an example bash automation to start the Text Generation Inference (TGI) server.

#!/bin/bashexport HOME_PATH=$(pwd)export TGI_VOLUME=${HOME_PATH}/output # path to the fine tuned modelexport MODEL=/data/modelexport TAG=latestdocker container rm tgi_serverdocker run -it --name tgi_server \ --gpus all \ --shm-size 1g \ -p 8080:80 \ -e MAX_INPUT_LENGTH=2000 \ -v ${TGI_VOLUME}:/data \ ghcr.io/huggingface/text-generation-inference:${TAG} \ --model-id $MODEL

Example output on starting the TGI server:

2024-XX-XXTXX:XX:05.902471Z INFO text_generation_launcher: Args { model_id: "/data/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2000, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, enable_cuda_graphs: false, hostname: "43594b3833aa", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }

2024-XX-XXTXX:XX:05.902611Z INFO download: text_generation_launcher: Starting download process.

2024-XX-XXTXX:XX:08.234407Z INFO text_generation_launcher: Trying to load a Peft model. It might take a while without feedback2024-XX-XXTXX:XX:45.290393Z INFO text_generation_launcher: Peft model detected.2024-XX-XXTXX:XX:45.290442Z INFO text_generation_launcher: Merging the lora weights.

2024-XX-XXTXX:XX:55.359262Z INFO text_generation_launcher: Saving the newly created merged model to /data/model2024-XX-XXTXX:XX::23.753381Z INFO download: text_generation_launcher: Successfully downloaded weights.2024-XX-XXTXX:XX:23.753637Z INFO shard-manager: text_generation_launcher: Starting shard rank=02024-XX-XXTXX:XX:30.399408Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-02024-XX-XXTXX:XX:30.458966Z INFO shard-manager: text_generation_launcher: Shard ready in 6.704379789s rank=02024-XX-XXTXX:XX:30.557558Z INFO text_generation_launcher: Starting Webserver2024-XX-XXTXX:XX:30.770246Z INFO text_generation_router: router/src/main.rs:237: Using local tokenizer config2024-XX-XXTXX:XX:30.775596Z WARN text_generation_router: router/src/main.rs:272: no pipeline tag found for model /data/model-example2024-XX-XXTXX:XX:30.793831Z INFO text_generation_router: router/src/main.rs:291: Warming up model<br>2024-XX-XXTXX:XX:32.244560Z INFO text_generation_router: router/src/main.rs:328: Setting max batch total tokens to 2279682024-XX-XXTXX:XX:32.244587Z INFO text_generation_router: router/src/main.rs:329: Connected...

Now, we are ready to test our fine-tuned model.

14. Implementing or using existing evaluation/testing frameworks to test/evaluate the fine-tuned model

When we test the newly fine-tuned model in a multi-turn way, we must save the conversation data between the model and the user.

Therefore, we can use Langchain for Long-term memory using persistent storage or implement our testing framework using Short-term memory to handle our multi-turn conversations where we remember previous inputs, prompts, and context to generate the following response.

Extract an example source code to build the following prompt based on previous input.

prompt = prompt_history + generate_prompt_from_template( prompt_template_2, prompt_question_template, question )

There are some additional evaluation frameworks in the blog post “Open-Source LLM Evaluation Frameworks in 2024” and for tests related to Text-to-SQL the Defog eval framework may can be helpful.

15. Define the metrics you want to use to display your evaluation results

Here are some potential metrics which can be useful: accuracy, latency, and grammar and there are many more. We can implement this metrics by ourself or using products for that. A good approach could be using watsonx.governance; it covers the following topics listed in its description:

Govern generative AI (gen AI) and machine learning (ML) models from any vendor, including IBM® watsonx.ai™, Amazon Sagemaker and Bedrock, Google Vertex and Microsoft Azure.
Evaluate and monitor for model health, accuracy, drift, bias, and gen AI quality.
Access robust governance, risk, and compliance capabilities featuring workflows with approvals, customizable dashboards, risk scorecards, and reports.
Use factsheet capabilities to collect and document model metadata automatically across the AI model lifecycle.

16. Running our own fine-tuned in a robust enterprise environment using `watsonx.ai` on-premise in a Cloud Pak for Data instance in a `Virtual Private Cloud` on AWS or on IBM Cloud

watsonxprovides a wide range of essential capabilities and is available on severalplatforms forenterpriseuse. The following list is an extract of the main topicsyou can findon the official web page.

Open: Based on open technologies that provide a variety of models to cover enterprise use cases and compliance requirements.
Targeted: Targeted to specific enterprise domains like HR, customer service or IT operations to unlock new value.
Trusted: Designed with principles of transparency, responsibility and governance so you can manage legal, regulatory, ethical and accuracy concerns.
Empowering:Go beyond being an AI user and become an AI value creator, owning the value your models create.

We can deploy our custom model fine-tuned model on-premise by following the instructions in the IBM Cloud documentation

14. Summary

As you noticed in this blog post, there are a lot of topics you need to take care of from the initial idea until a fine-tuned model gets into production.

We follow the topics related to fine-tuning a model.

I knowI didn’t have a deep dive into all the topics, but I want in this blogpost, that I ensure thatataminimumI have a sentence for each of the topics, or I provided a potential entry point you can look into.

LLMs will change our lives and society in the future, and it is fantastic when you are a part of the journey to change lives by changing the business and not being changed.

I hope this was useful to you and let’s see what’s next?

Greetings,

Thomas

#fine-tune, #llm, #multi-turn, #watsonx, #mistral, #ai