Skip to main content

Fine tuning LLMs - Episode 1

Published: 2023-04-18
Updated: 2023-04-18

Let’s create an AI character together, using open source components.

Follow along and you should be able to replicate these results.

At a high level, for the first experiment, we’ll start with a large language model, Llama (6B) and fine-tune it using our own conversation dataset. We’ll use a niftly little library called xturing that makes it all very easy, once you’ve gotten it to build.

Before we begin, I must send out a special thanks to the open source community who publishes their libraries and datasets. Without you this technology would be locked away under key and only controlled by a select few, and the pace of development would be slower, and the benefits concentrated.

Prerequisites

Grab and clone xturing

To save you some time, it only works on linux, mac. On windows you must use WSL. WSL is what I used in this demo.

install dependencies

On windows, this assumes you are in a wsl terminal.

Make yourself a python virtual environment. If this fails you need some version of python 3 installed. Mine was > 3.10 For those new to this, every time you open a new terminal, you want to re-run the activate script.

python3 -m venv env
. env/bin/activate

Now that you have a nice isolated python environment, you can start installing dependencies. (I’m not actually sure torchvision and torchaudio are required.)

Note: this installs system-wide

pip install torch torchvision torchaudio 

Install cuda using their nifty platform selector. Note that for WSL, this is is the WSL-Ubunto distribution.

install cuda

pip install xturing

Some torch extensions are compiled on the fly. need python dev headers. This was reqired on WSL, likely also on linux.

I didn’t need this on my mac, but… I may have had it already installed.

Note: this also installs system-wide.

sudo apt-get install python3-dev 

You should now be able to run the xturing example code. Try:

xturing chat -m llama_lora

For the rest of this, i’ll assume you’ve tried a few of the xTuring examples in the documentation. Else i’ll be rehashing the same stuff.

Dataset preparation

I had the benefit of having a dataset generator on-hand. It’s not the usual source, mind you. I previously worked on a AI-powered language learning game called Voxabular. It contains 100k+ lines of custom dialog for characters.

I refactored the code to produce a clean dataset of player queries with answers from that library of lines. Turns out I didn’t need that many. I have 800 or so for Marge, and that did fine. There is much repetition and variation in my dataset. I think you could probably get away with much less, focusing on the lines that define your character’s unique personality, and key background information should do it.

Here’s a sample of Marge’s dialog queries and responses, from the game:

[
  {
    "queryLine": "I need directions.",
    "responseLine": "And where to might you be going?",
  },
  {
    "queryLine": "Do you know where the ... is?",
    "responseLine": "Where do you want to go?",
  },
  {
    "queryLine": "I need help finding Main Street.",
    "responseLine": "This is Main Street, hon.\nJust exit here and you're there.",
  },
  {
    "queryLine": "Where is the hotel?",
    "responseLine": "The hotel? That's 18 Wednesday Street, if I'm not mistaken.",
  },
  {
    "queryLine": "I forgot your name.",
    "responseLine": "My name is Marge.\nWhat do they call you?",
  },
  {
    "queryLine": "Where are you from?",
    "responseLine": "I was born in the south, just past the border,\nshortly after my parents emigrated from Mexico.\nMy mama was pregnant when they did the trip!",
  },
  {
    "queryLine": "Can I ask about your family?",
    "responseLine": "It's just me and my pops now.\nMama has gone to a better place.\nWhile I never had any siblings,\nI grew up surrounded by cousins!",
  }
]

I followed the examples of the xturing documentation to get the dataset in the right format. In this case the trick was using the “TextDataset” class, since i just had text -> target pairs. (later i’ll show how to do it if you just have raw text. stay tuned)

Training

I tried a few models, but the first one is Llama Lora. Using LoRA is key here. paper. I’m simpliying here, but this just means we’re training the last layer of the perceptron, instead of the whole network. It makes it faster to train, and in theory this allows you save/load only the last layer instead of having to store the whole model again. Huge gains!

Trained this using 3 epochs took about 8 hours on my nvidia 3090 (I know! Last gen!)

Results and interpertation

USER > hello
MODEL > . Greetings.
USER > hello.
MODEL >  Greetings.
USER > How are you?
MODEL >  I'm Marge. Pleased to meet you.
USER > How was your day?
MODEL >  I'm a mechanic, hon.
I fix cars for a living! How was your day?
USER > I'm looking for the Library.
MODEL >  A Library...
I have no idea.
USER > Do you know where the Inn is?
MODEL >  Bella's place is at 18 Wednesday Street.
USER > Nice!
MODEL >  See you around.
USER > Do you have any hobbies?
MODEL >  I'm a beekeeper in my spare time.
Huh? Caught you off-guard, hon?
Bees are wonderful animals that should be protected!
USER > What other animals do you like?
MODEL >  Cats are my weakness.
I'm partial to tabbies myself.
USER > Are you married?
MODEL >  I'm divorced.
Married women don't make the cut.
USER > Do you ever think about quitting your job and do beekeeping full time?
MODEL >  I'm a beekeeper in my spare time, hon.
Honey, you don't have to lie about it!
USER > What do you dream about?
MODEL >  It's nice to know you, hon.

The model is fundamentally just a token predictor, and it’s key to get the stop tokens right. In a later post i’ll show how openai helps you get those right and what they use, and this will all make more sense.

In the meantime, notice how it was important that i finish the prompts with the sentence punctuation, else the completion/response would do it for me. There is no magic prompt/response pair. During training, the two get contatenated, and the whole sequence is learned.

A few of these responses are wrote memorizations from the dataset. Some of them get creative! (great!)

Of note is this one:

USER > Are you married?
MODEL >  I'm divorced.
Married women don't make the cut.

That’s all new. Nice improv.

This one doens’t seem too coherent:

USER > What do you dream about?
MODEL >  It's nice to know you, hon.

Like it didn’t understand what i was talking about.

This is

USER > What other animals do you like?
MODEL >  Cats are my weakness.
I'm partial to tabbies myself.

I don’t know what tabbies are. In the dataset, Marge says she has a mouse. So I woudn’t call this one accurate. Maybe the formulation ‘other’ threw it off?

Conclusion and further work

Amazing what you can do with off-the-shelf open source tools! This was all running on my local machine. It still needs some work… I think results get dramatically better as you increase the number of parameters of the base model. The theory also says this gets better as you add some more data.

There’s probably something you can do with generating a dataset using a larger model, and using that to train a smaller one.

Next up, we’re going to try a few other base models to see how they compare.

Watch this space!