!pip install langchain==0.0.228 yt_dlp==2023.7.6 tiktoken==0.5.1 docarray==0.38.0 chromadb==0.4.19 openai==0.28 --quiet
Objectives
Videos can be full of useful information, but getting hold of that info can be slow, since you need to watch the whole thing or try skipping through it. It can be much faster to use a bot to ask questions about the contents of the transcript.
In this project, you’ll download a tutorial video from YouTube, transcribe the audio, and create a simple Q&A bot to ask questions about the content.
- Understanding the building blocks of working with Multimodal AI projects
- Working with some of the fundamental concepts of LangChain
- How to use the Whisper API to transcribe audio to text
- How to combine both LangChain and Whisper API to create ask questions of any YouTube video
Before you begin
You’ll need a developer account with OpenAI and a create API Key. The API secret key will be stored in your ‘Environment Variables’ on the side menu. See the getting-started.ipynb notebook for details on setting this up.
Task 0: Setup
The project requires several packages that need to be installed into Workspace.
langchain
is a framework for developing generative AI applications.yt_dlp
lets you download YouTube videos.tiktoken
converts text into tokens.docarray
makes it easier to work with multi-model data (in this case mixing audio and text).
Instructions
Run the following code to install the packages.
Installing Relevant Libraries
Write and Store your OpenAI API Key in the .env
file as OPENAI_API_KEY = "PASTE_YOUR_OPENAI_API_KEY"
.
Loading the OpenAI API Secret file from .env
file.
from dotenv import load_dotenv, find_dotenv
## Loading the Secrets from the .env file
print(load_dotenv())
True
Task 1: Import The Required Libraries
For this project we need the os
and the yt_dlp
packages to download the YouTube video of your choosing, convert it to an .mp3
and save the file. We will also be using the openai
package to make easy calls to the OpenAI models we will use.
Import the following packages.
- Import
os
- Import
openai
- Import
yt_dlp
with the aliasyoutube_dl
- From the
yt_dlp
package, importDowloadError
- Assign
openai_api_key
toos.getenv("OPENAI_API_KEY")
Importing the Required Packages including: “os” “openai” “yt_dlp as youtube_dl” and “from yt_dl import Download Error”
# Import the os package
import os
# Import Glob package
import glob
# Import the openai package
import openai
# Import the yt_dlp package as youtube_dl
import yt_dlp as youtube_dl
# Import DownloadError from yt_dlp
from yt_dlp import DownloadError
# Import DocArray
import docarray
We will also assign the variable openai_api_key
to the environment variable “OPEN_AI_KEY”. This will help keep our key secure and remove the need to write it in the code here.
= os.getenv("OPENAI_API_KEY") openai.api_key
Task 2: Download the YouTube Video
After creating the setup, the first step we will need to do is download the video from Youtube and convert it to an audio file (.mp3).
We’ll download a DataCamp tutorial about machine learning in Python.
We will do this by setting a variable to store the youtube_url
and the output_dir
that we want the file to be stored.
The yt_dlp
allows us to download and convert in a few steps but does require a few configuration steps. This code is provided to you.
Lastly, we will create a loop that looks in the output_dir
to find any .mp3 files. Then we will store those in a list called audio_files
that will be used later to send each file to the Whisper model for transcription.
Create the following: - Two variables - youtube_url
to store the Video URL and output_dir
that will be the directory where the audio files will be saved. - For this tutorial, we can set the youtube_url
to the following "https://www.youtube.com/watch?v=aqzxYofJ_ck"
and the output_dir
to files/audio/
. In the future, you can change these values. - Use the ydl_config
that is provided to you
# An example YouTube tutorial video
= "https://www.youtube.com/watch?v=aqzxYofJ_ck"
youtube_url # Directory to store the downloaded video
= "files/audio/"
output_dir
# Config for youtube-dl
= {
ydl_config "format": "bestaudio/best",
"postprocessors": [
{"key": "FFmpegExtractAudio",
"preferredcodec": "mp3",
"preferredquality": "192",
}
],"outtmpl": os.path.join(output_dir, "%(title)s.%(ext)s"),
"verbose": True
}
# Check if the output directory exists, if not create it
if not os.path.exists(output_dir):
os.makedirs(output_dir)# Print a message indicating which video is being downloaded
print(f"Downloading Video from the url : {youtube_url}")
# Attempt to download the video using the specified configuration
# If a DownloadError occurs, attempt to download the video again
try:
with youtube_dl.YoutubeDL(ydl_config) as ydl:
ydl.download([youtube_url])except DownloadError:
with youtube_dl.YoutubeDL(ydl_config) as ydl:
ydl.download([youtube_url])
[debug] Encodings: locale cp1252, fs utf-8, pref cp1252, out UTF-8 (No VT), error UTF-8 (No VT), screen UTF-8 (No VT)
[debug] yt-dlp version stable@2023.07.06 [b532a3481] (pip) API
[debug] params: {'format': 'bestaudio/best', 'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3', 'preferredquality': '192'}], 'outtmpl': 'files/audio/%(title)s.%(ext)s', 'verbose': True, 'compat_opts': set()}
[debug] Python 3.10.5 (CPython AMD64 64bit) - Windows-10-10.0.19044-SP0 (OpenSSL 1.1.1n 15 Mar 2022)
[debug] exe versions: ffmpeg 4.2.2, ffprobe 4.2.2
[debug] Optional libraries: Cryptodome-3.20.0, brotli-1.1.0, certifi-2022.12.07, mutagen-1.47.0, sqlite3-2.6.0, websockets-12.0
[debug] Proxy map: {}
[debug] Loaded 1855 extractors
[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, vcodec:vp9.2, channels, acodec, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), channels, acodec, lang, proto, size, br, asr, vext, aext, hasaud, id
[debug] Invoking http downloader on "https://rr4---sn-gwpa-qxaee.googlevideo.com/videoplayback?expire=1708196638&ei=vq7QZde2H4ru4-EPpdWaCA&ip=2409%3A40d0%3A100f%3Aa083%3A490c%3Ab20e%3A7354%3A2994&id=o-AOWVFk6tMKFyCdZnCS5VvP0VT03Sukh6VN-al1WR5Hxf&itag=251&source=youtube&requiressl=yes&xpc=EgVo2aDSNQ%3D%3D&mh=zw&mm=31%2C29&mn=sn-gwpa-qxaee%2Csn-gwpa-qxae7&ms=au%2Crdu&mv=m&mvi=4&pl=36&pcm2=yes&initcwndbps=622500&vprv=1&svpuc=1&mime=audio%2Fwebm&gir=yes&clen=10932652&dur=752.701&lmt=1654008313150389&mt=1708174646&fvip=4&keepalive=yes&fexp=24007246&c=ANDROID&txp=5318224&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cxpc%2Cpcm2%2Cvprv%2Csvpuc%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=AJfQdSswRQIhAKjTizAipxwXaTyCpXR5dhzxLl9eqsBGjOnNHGbf-ausAiAjcteMHs-gfhtd0D1gWI7rDbyBjuufbqO-JqBgIPUJVA%3D%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=APTiJQcwRQIhAMSBwIxjqtjt_2MwwBItrQLvQMekr1i4XH49TC2r5sMBAiBF4tDjQT_1sLALeO5lYGTVOcQ6QZKu1GfD2PlrVGG7vw%3D%3D"
[debug] File locking is not supported. Proceeding without locking
[debug] ffmpeg command line: ffprobe -show_streams "file:files\audio\Python Machine Learning Tutorial | Splitting Your Data | Databytes.webm"
[debug] ffmpeg command line: ffmpeg -y -loglevel "repeat+info" -i "file:files\audio\Python Machine Learning Tutorial | Splitting Your Data | Databytes.webm" -vn -acodec libmp3lame "-b:a" 192.0k -movflags "+faststart" "file:files\audio\Python Machine Learning Tutorial | Splitting Your Data | Databytes.mp3"
Downloading Video from the url : https://www.youtube.com/watch?v=aqzxYofJ_ck
[youtube] Extracting URL: https://www.youtube.com/watch?v=aqzxYofJ_ck
[youtube] aqzxYofJ_ck: Downloading webpage
[youtube] aqzxYofJ_ck: Downloading ios player API JSON
[youtube] aqzxYofJ_ck: Downloading android player API JSON
[youtube] aqzxYofJ_ck: Downloading m3u8 information
[info] aqzxYofJ_ck: Downloading 1 format(s): 251
[download] Destination: files\audio\Python Machine Learning Tutorial | Splitting Your Data | Databytes.webm
[download] 0.0% of 10.43MiB at 47.62KiB/s ETA 03:44[download] 0.0% of 10.43MiB at 120.00KiB/s ETA 01:28[download] 0.1% of 10.43MiB at 250.00KiB/s ETA 00:42[download] 0.1% of 10.43MiB at 468.77KiB/s ETA 00:22[download] 0.3% of 10.43MiB at 861.04KiB/s ETA 00:12[download] 0.6% of 10.43MiB at 851.41KiB/s ETA 00:12[download] 1.2% of 10.43MiB at 1.20MiB/s ETA 00:08[download] 2.4% of 10.43MiB at 1.28MiB/s ETA 00:07[download] 4.8% of 10.43MiB at 1.09MiB/s ETA 00:09[download] 9.6% of 10.43MiB at 1.09MiB/s ETA 00:08[download] 19.2% of 10.43MiB at 1.19MiB/s ETA 00:07[download] 31.8% of 10.43MiB at 1.31MiB/s ETA 00:05[download] 46.6% of 10.43MiB at 1.36MiB/s ETA 00:04[download] 60.9% of 10.43MiB at 1.41MiB/s ETA 00:02[download] 76.1% of 10.43MiB at 1.40MiB/s ETA 00:01[download] 89.3% of 10.43MiB at 1.46MiB/s ETA 00:00[download] 94.5% of 10.43MiB at 1.47MiB/s ETA 00:00[download] 94.5% of 10.43MiB at 76.60KiB/s ETA 00:07[download] 94.5% of 10.43MiB at 199.39KiB/s ETA 00:02[download] 94.6% of 10.43MiB at 387.87KiB/s ETA 00:01[download] 94.6% of 10.43MiB at 748.19KiB/s ETA 00:00[download] 94.8% of 10.43MiB at 1.16MiB/s ETA 00:00[download] 95.1% of 10.43MiB at 1.01MiB/s ETA 00:00[download] 95.7% of 10.43MiB at 1.24MiB/s ETA 00:00[download] 96.9% of 10.43MiB at 1.39MiB/s ETA 00:00[download] 99.3% of 10.43MiB at 1.64MiB/s ETA 00:00[download] 100.0% of 10.43MiB at 1.67MiB/s ETA 00:00[download] 100% of 10.43MiB in 00:00:07 at 1.39MiB/s
[ExtractAudio] Destination: files\audio\Python Machine Learning Tutorial | Splitting Your Data | Databytes.mp3
Deleting original file files\audio\Python Machine Learning Tutorial | Splitting Your Data | Databytes.webm (pass -k to keep)
To find the audio files that we will use the glob
module that looks in the output_dir
to find any .mp3 files. Then we will append the file to a list called audio_files
. This will be used later to send each file to the Whisper model for transcription.
Create the following: - A variable called audio_files
that uses the glob module to find all matching files with the .mp3
file extension - Select the first first file in the list and assign it to audio_filename
- To verify the filename, print audio_filename
Find the audio file in the output directory
# Find all the audio files in the output directory
= glob.glob(os.path.join(output_dir, "*.mp3"))
audio_file
# Select the first audio file in the list
= audio_file[0]
audio_filename
# Print the name of the selected audio file
print(audio_filename)
files/audio\Python Machine Learning Tutorial | Splitting Your Data | Databytes.mp3
Task 3: Transcribe the Video using Whisper
In this step we will take the downloaded and converted Youtube video and send it to the Whisper model to be transcribed. To do this we will create variables for the audio_file
, for the output_file
and the model.
Using these variables we will: - create a list to store the transcripts - Read the Audio File - Send the file to the Whisper Model using the OpenAI package
To complete this step, create the following: - A variable named audio_file
that is assigned the audio_filename
we created in the last step - A variable named output_file
that is assigned the value "files/transcripts/transcript.txt"
- A variable named model
that is assigned the value "whisper-1"
- An empty list called transcripts
- A variable named audio
that uses the open
method and "rb"
modifier on the audio_file
- A variable to store the response
from the openai.Audio.transcribe
method that takes in the model
and audio
variables - Append the response["text"]
to the transcripts
list.
import openai
# Define function parameters
= audio_filename
audio_file = "files/transcripts/transcript.txt"
output_file = "whisper-1"
model
# Set the API key
= os.getenv("OPENAI_API_KEY")
openai.api_key
# Transcribe the audio file to text using OpenAI API
print("Converting Audio to Text.....")
with open(audio_file, "rb") as audio:
= openai.Audio.transcribe(model, audio)
response
# Extract the transcript from the response
= response['text'] transcript
Converting Audio to Text.....
To save the transcripts to text files we will use the below provided code:
# If an output file is specified, save the transcript to a .txt file
if output_file is not None:
# Create the directory for the output file if it doesn't exist
=True)
os.makedirs(os.path.dirname(output_file), exist_ok# Write the transcript to the output file
with open(output_file, "w") as file:
file.write(transcript)
# Print the transcript to the console to verify it worked
print(transcript)
Hi, in this tutorial, we're going to look at a data pre-processing technique for machine learning called splitting your data. That is splitting your data set into a training set and a testing set. Now, before we get to the code, you might wonder, why do I need to do this? And really, there are going to be two problems if you don't. So if you train your machine learning model on your whole data set, then you've not tested the model on anything else. And that means you don't know how well your model is going to perform on other data sets. Secondly, it's actually even worse than this, because you risk overfitting the model. And that means that you've made your model work really well for one data set, but that gives a cost of model performance on other data sets. So not only do you not know how well the model is going to perform on other data sets, it's probably going to be worse than it could be. So you might also wonder when in your machine learning workflow, as you're writing these different types of code, when does this come? So what's the point when you need to split your data set? And it's normally the last thing you do before feature engineering. So if you do this after feature engineering, then you risk having a problem called data leakage. And that means information from the testing set is going to be available in the training set, which is a form of cheating because it's going to make your model appear to perform better than it actually does. So it's giving you a sort of false sense of security. So if you find yourself doing feature engineering and you've not yet split your data into training and testing sets, then you need to back up a step. We're going to take a look at some loan application data. So I'm using Datacamp Workspace here, and this is one of the data sets that is available as standard with Datacamp Workspace. So there is a workspace template available if you want to try doing your own analysis on this data set. So because this data is in CSV format, I'm going to import the pandas package as pd. That's the sort of standard alias for it. And then we actually just one function from scikit-learn. So this is in the scikit-learn model selection sub-module. And the function for splitting into training and testing sets is called train-test-split. So let's run that. All right. So this data is about loan applications. So I'm just going to call it loan-applications. And we can use pd.read.csv because it is in a CSV file. And the file is called loan-data.csv. Let me just check and see if I got that correct. loan-data.csv. Yes, it did. Okay. So let me just copy and paste this variable name so we can print out the results. Okay. So here you can see the table here. Actually, to make this easier, we've got 9,500 rows here. What I'm going to do is I'm just going to import the first 1,000 rows. And this is going to make some of the results a bit easier to understand. All right. So now we've only got 1,000 rows of data. You can see we've got this column called credit policy. This is going to be our response variable. And then we've got a load of other variables we can use as features. This purpose column, because it's a categorical column, that's going to become important how we deal with that in a moment. All right. So first of all, we'll just concentrate on the response. So the response variable is called credit.policy. And so each row is an application. So when the application meets the underwriting policy, so it meets the kind of loan criteria, it takes the value one and it's a zero if the application was not up to scratch. So I'm going to call this variable response. Some people like to call the response variable just lowercase y. I think response is a bit more meaningful, particularly in this case. So we're going to start off with the loan applications data frame. And I'm going to take the credit policy column and then I'm going to copy and paste this variable name again so you can see the results. So in this case, it is a Pandas series and it's got ones and zeros. All right. So we're going to use all the other columns for features. So again, let's call this variable features. Some people like to use capital X for this. So again, I'm going to start with loan applications with a T somewhere in there. And we use every column except credit policy. So this drop method is a little shortcut for just like saying I want everything in the data frame except a specific set of columns. Now, one extra little trick. So as I mentioned, we have this categorical variable called purpose. So we need to do one hot encoding on this in order to turn it into a series of numeric columns with zeros and ones. So we can use pd.get dummies for that. Let's code another line so it's easy to see how it breaks down. And then I'm going to print this out. So, so far, this is pretty standard code just for splitting your data set into a response variable and some features. So here we've got 19 columns now. So the one important thing to note is that purpose column is now several different columns with ones and zeros. All right. So now the crux of this. So we're going to split these, this, the response and the features up into training sets and testing sets. So we're going to call the train test split function and pass those two variables in. So we're going to call train test split. I'm going to pass it the response first and then the features. So let's run this and you can see the output. It's a little bit squiffy. So what we get is a list and it actually has four different things in it. So we get the responses and the features for the training set and the responses and features for the testing set. And actually the trick is just remembering which order they're returned in. So rather than returning a list, it's slightly easier if we use variable unpacking and we return four different objects together. So we're going to return four things from this list. Function call. So I'm going to call this response train and I think this one is a response test and then we go features train and features test. So we've got four different variables here and let me run that. So I can sort of print these out one at a time, but it's not that exciting because you've seen the whole data set before. It's just bits of it. So what's actually slightly... paste it twice. So what's actually slightly more useful is if we take a look at the shape of each of these and you can see how much data has ended up in each one. So I'm going to print this out. And we're going to do the same with the test set and features test. All right, so we've got four different variables here. Now you can see we started off with a thousand rows and so the responses, these are series, they don't have any columns, but you can see that 75% of them, so 750 out of a thousand, have ended up in the training set and 250, I'll just highlight that, have ended up in the testing set. And it's the same with the features as well. So we've got 750 in the training set and 250 in the test set. You've got 19 columns here because 19 features. So by default, 75% of our data has ended up in the training set, 25% has ended up in the testing set. And normally this is perfectly fine. Sometimes if you have a small data set, you might want a little bit more in the training set and a little bit less in the testing set. And if you've got a very large data set, then you might say, well, okay, I want 70% of data in the training set and 30% in the testing set. So in this case, because we've only got a thousand rows, let's shrink the testing set a little bit. So we're going to do the same again, but we're going to use the test size argument to get a smaller testing set. So I'm just going to copy and paste this code, run the typing again. So the change here, we're going to add an additional argument called test size, and we set that to 0.2. So we're going to have 80% in the training set, 20% in the test set. Let's run that. And again, I am going to copy and paste this code that shows the shapes of the output. And here you can see now we've got 800 of the thousand rows in the training set, 200 in the testing set. Now, one more thing you might be interested in doing is reproducing the values that are provided in both the training testing sets. What I mean by that is that by default, the training testing sets are randomly generated. So each of the rows from the data set is randomly allocated to one or other of these sets. If you're writing a report, you might want to have your results exactly the same every time. Another case where this is useful is if you're trying to find a problem with your model, then you might want to be able to demonstrate the problem precisely to someone else. So sometimes you want your code to be exactly reproducible, despite the fact that you've got random things in it. And for this work, you need to set a random seed, and you can use the random state argument to do this. So what I'm going to do is I'm going to run this code twice, but we're going to set the random state argument. You can make any number you like. I'm just going to pick 999. And so because we set the random state, this code's going to run the same thing twice. Let me give different variable names for what's being returned here. So I'm going to run this. So we've done the split twice. And so let's have a look at one of these. So if you have a look at features train, you can see the values. You've got row 46, row 748, and so on. And then let's add another one of these. So we're going to look at features train two. And you see, even though it's random, we have exactly the same results. So it's row 46, row 748, row 524, and so on. So exactly the same result in both cases. And that's more or less all there is to splitting your data into training and testing sets. I hope it's been helpful. you
Task 4: Create a TextLoader using LangChain
In order to use text or other types of data with LangChain we must first convert that data into Documents. This is done by using loaders. In this tutorial, we will use the TextLoader
that will take the text from our transcript and load it into a document.
To complete this step, do the following: - Import TextLoader
from langchain.document_loaders
- Create a variable called loader
that uses the TextLoader
method which takes in the directory of the transcripts "./files/text"
- Create a variable called docs
that is assigned the result of calling the loader.load()
method.
# Import the TextLoader class from the langchain.document_loaders module
from langchain.document_loaders import TextLoader
# Create a new instance of the TextLoader class, specifying the directory containing the text files
= TextLoader("./files/transcripts/transcript.txt")
loader
# Load the documents from the specified directory using the TextLoader instance
= loader.load() docs
C:\Users\Chirag Sharma\AppData\Local\Programs\Python\Python310\lib\site-packages\deeplake\util\check_latest_version.py:32: UserWarning:
A newer version of deeplake (3.8.20) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.
# Show the first element of docs to verify it has been loaded
0] docs[
Document(page_content="Hi, in this tutorial, we're going to look at a data pre-processing technique for machine learning called splitting your data. That is splitting your data set into a training set and a testing set. Now, before we get to the code, you might wonder, why do I need to do this? And really, there are going to be two problems if you don't. So if you train your machine learning model on your whole data set, then you've not tested the model on anything else. And that means you don't know how well your model is going to perform on other data sets. Secondly, it's actually even worse than this, because you risk overfitting the model. And that means that you've made your model work really well for one data set, but that gives a cost of model performance on other data sets. So not only do you not know how well the model is going to perform on other data sets, it's probably going to be worse than it could be. So you might also wonder when in your machine learning workflow, as you're writing these different types of code, when does this come? So what's the point when you need to split your data set? And it's normally the last thing you do before feature engineering. So if you do this after feature engineering, then you risk having a problem called data leakage. And that means information from the testing set is going to be available in the training set, which is a form of cheating because it's going to make your model appear to perform better than it actually does. So it's giving you a sort of false sense of security. So if you find yourself doing feature engineering and you've not yet split your data into training and testing sets, then you need to back up a step. We're going to take a look at some loan application data. So I'm using Datacamp Workspace here, and this is one of the data sets that is available as standard with Datacamp Workspace. So there is a workspace template available if you want to try doing your own analysis on this data set. So because this data is in CSV format, I'm going to import the pandas package as pd. That's the sort of standard alias for it. And then we actually just one function from scikit-learn. So this is in the scikit-learn model selection sub-module. And the function for splitting into training and testing sets is called train-test-split. So let's run that. All right. So this data is about loan applications. So I'm just going to call it loan-applications. And we can use pd.read.csv because it is in a CSV file. And the file is called loan-data.csv. Let me just check and see if I got that correct. loan-data.csv. Yes, it did. Okay. So let me just copy and paste this variable name so we can print out the results. Okay. So here you can see the table here. Actually, to make this easier, we've got 9,500 rows here. What I'm going to do is I'm just going to import the first 1,000 rows. And this is going to make some of the results a bit easier to understand. All right. So now we've only got 1,000 rows of data. You can see we've got this column called credit policy. This is going to be our response variable. And then we've got a load of other variables we can use as features. This purpose column, because it's a categorical column, that's going to become important how we deal with that in a moment. All right. So first of all, we'll just concentrate on the response. So the response variable is called credit.policy. And so each row is an application. So when the application meets the underwriting policy, so it meets the kind of loan criteria, it takes the value one and it's a zero if the application was not up to scratch. So I'm going to call this variable response. Some people like to call the response variable just lowercase y. I think response is a bit more meaningful, particularly in this case. So we're going to start off with the loan applications data frame. And I'm going to take the credit policy column and then I'm going to copy and paste this variable name again so you can see the results. So in this case, it is a Pandas series and it's got ones and zeros. All right. So we're going to use all the other columns for features. So again, let's call this variable features. Some people like to use capital X for this. So again, I'm going to start with loan applications with a T somewhere in there. And we use every column except credit policy. So this drop method is a little shortcut for just like saying I want everything in the data frame except a specific set of columns. Now, one extra little trick. So as I mentioned, we have this categorical variable called purpose. So we need to do one hot encoding on this in order to turn it into a series of numeric columns with zeros and ones. So we can use pd.get dummies for that. Let's code another line so it's easy to see how it breaks down. And then I'm going to print this out. So, so far, this is pretty standard code just for splitting your data set into a response variable and some features. So here we've got 19 columns now. So the one important thing to note is that purpose column is now several different columns with ones and zeros. All right. So now the crux of this. So we're going to split these, this, the response and the features up into training sets and testing sets. So we're going to call the train test split function and pass those two variables in. So we're going to call train test split. I'm going to pass it the response first and then the features. So let's run this and you can see the output. It's a little bit squiffy. So what we get is a list and it actually has four different things in it. So we get the responses and the features for the training set and the responses and features for the testing set. And actually the trick is just remembering which order they're returned in. So rather than returning a list, it's slightly easier if we use variable unpacking and we return four different objects together. So we're going to return four things from this list. Function call. So I'm going to call this response train and I think this one is a response test and then we go features train and features test. So we've got four different variables here and let me run that. So I can sort of print these out one at a time, but it's not that exciting because you've seen the whole data set before. It's just bits of it. So what's actually slightly... paste it twice. So what's actually slightly more useful is if we take a look at the shape of each of these and you can see how much data has ended up in each one. So I'm going to print this out. And we're going to do the same with the test set and features test. All right, so we've got four different variables here. Now you can see we started off with a thousand rows and so the responses, these are series, they don't have any columns, but you can see that 75% of them, so 750 out of a thousand, have ended up in the training set and 250, I'll just highlight that, have ended up in the testing set. And it's the same with the features as well. So we've got 750 in the training set and 250 in the test set. You've got 19 columns here because 19 features. So by default, 75% of our data has ended up in the training set, 25% has ended up in the testing set. And normally this is perfectly fine. Sometimes if you have a small data set, you might want a little bit more in the training set and a little bit less in the testing set. And if you've got a very large data set, then you might say, well, okay, I want 70% of data in the training set and 30% in the testing set. So in this case, because we've only got a thousand rows, let's shrink the testing set a little bit. So we're going to do the same again, but we're going to use the test size argument to get a smaller testing set. So I'm just going to copy and paste this code, run the typing again. So the change here, we're going to add an additional argument called test size, and we set that to 0.2. So we're going to have 80% in the training set, 20% in the test set. Let's run that. And again, I am going to copy and paste this code that shows the shapes of the output. And here you can see now we've got 800 of the thousand rows in the training set, 200 in the testing set. Now, one more thing you might be interested in doing is reproducing the values that are provided in both the training testing sets. What I mean by that is that by default, the training testing sets are randomly generated. So each of the rows from the data set is randomly allocated to one or other of these sets. If you're writing a report, you might want to have your results exactly the same every time. Another case where this is useful is if you're trying to find a problem with your model, then you might want to be able to demonstrate the problem precisely to someone else. So sometimes you want your code to be exactly reproducible, despite the fact that you've got random things in it. And for this work, you need to set a random seed, and you can use the random state argument to do this. So what I'm going to do is I'm going to run this code twice, but we're going to set the random state argument. You can make any number you like. I'm just going to pick 999. And so because we set the random state, this code's going to run the same thing twice. Let me give different variable names for what's being returned here. So I'm going to run this. So we've done the split twice. And so let's have a look at one of these. So if you have a look at features train, you can see the values. You've got row 46, row 748, and so on. And then let's add another one of these. So we're going to look at features train two. And you see, even though it's random, we have exactly the same results. So it's row 46, row 748, row 524, and so on. So exactly the same result in both cases. And that's more or less all there is to splitting your data into training and testing sets. I hope it's been helpful. you", metadata={'source': './files/transcripts/transcript.txt'})
Task 4: Creating an In-Memory Vector Store
Now that we have created Documents of the transcription, we will store that Document in a vector store. Vector stores allows LLMs to traverse through data to find similiarity between different data based on their distance in space.
For large amounts of data, it is best to use a designated Vector Database. Since we are only using one transcript for this tutorial, we can create an in-memory vector store using the docarray
package.
We will also tokenize our queries using the tiktoken
package. This means that our query will be seperated into smaller parts either by phrases, words or characters. Each of these parts are assigned a token which helps the model “understand” the text and relationships with other tokens.
Instructions
- Import the
tiktoken
package.
# Import the tiktoken package
import tiktoken
Task 5: Create the Document Search
We will now use LangChain to complete some important operations to create the Question and Answer experience. Let´s import the follwing:
- Import
RetrievalQA
fromlangchain.chains
- this chain first retrieves documents from an assigned Retriver and then runs a QA chain for answering over those documents - Import
ChatOpenAI
fromlangchain.chat_models
- this imports the ChatOpenAI model that we will use to query the data - Import
DocArrayInMemorySearch
fromlangchain.vectorstores
- this gives the ability to search over the vector store we have created. - Import
OpenAIEmbeddings
fromlangchain.embeddings
- this will create embeddings for the data store in the vector store. - Import
display
andMarkdown
fromIPython.display
- this will create formatted responses to the queries.
# Import the RetrievalQA class from the langchain.chains module
from langchain.chains import RetrievalQA
# Import the ChatOpenAI class from the langchain.chat_models module
from langchain.chat_models import ChatOpenAI
# Import the DocArrayInMemorySearch class from the langchain.vectorstores module
from langchain.vectorstores import DocArrayInMemorySearch
# Import the OpenAIEmbeddings class from the langchain.embeddings module
from langchain.embeddings import OpenAIEmbeddings
Now we will create a vector store that will use the DocArrayInMemory
search methods which will search through the created embeddings created by the OpenAI Embeddings function.
To complete this step: - Create a variable called db
- Assign the db
variable to store the result of the method DocArrayInMemorySearch.from_documents
- In the DocArrayInMemorySearch method, pass in the docs
and a function call to OpenAIEmbeddings()
# Create a new DocArrayInMemorySearch instance from the specified documents and embeddings
= DocArrayInMemorySearch.from_documents(
db
docs, OpenAIEmbeddings() )
We will now create a retriever from the db
we created in the last step. This enables the retrieval of the stored embeddings. Since we are also using the ChatOpenAI
model, will assigned that as our LLM.
Create the following: - A variable called retriever
that is assigned db.as_retriever()
- A variable called llm
that creates the ChatOpenAI
method with a set temperature
of 0.0
. This will controle the variability in the responses we receive from the LLM.
# Convert the DocArrayInMemorySearch instance to a retriever
= db.as_retriever()
retriever # Create a new ChatOpenAI instance with a temperature of 0.0
= ChatOpenAI(temperature=0.0, model_name="gpt-3.5-turbo") llm
Our last step before starting to ask questions is to create the RetrievalQA
chain. This chain takes in the:
- The llm
we want to use - The chain_type
which is how the model retrieves the data - The retriever
that we have created - An option called verbose
that allows use to see the seperate steps of the chain
Create a variable called qa_stuff
. This variable will be assigned the method RetrievalQA.from_chain_type
.
Use the following settings inside this method: - llm=llm
- chain_type="stuff"
- retriever=retriever
- verbose=True
# Create a new RetrievalQA instance with the specified parameters
= RetrievalQA.from_chain_type(
qa_stuff =llm,
llm="stuff",
chain_type=retriever,
retriever=True,
verbose
)# The ChatOpenAI instance to use for generating responses
# The type of chain to use for the QA system
# The retriever to use for retrieving relevant documents
# Whether to print verbose output during retrieval and generation
Task 5: Create the Queries
Now we are ready to create queries about the YouTube video and read the responses from the LLM. This done first by creating a query and then running the RetrievalQA we setup in the last step and passing it the query.
To create the questions to ask the model complete the following steps: - Create a variable call query
and assigned it a string value of "What is this tutorial about?"
- Create a response
variable that will store the result of qa_stuff.run(query)
- Show the resposnse
# Set the query to be used for the QA system
= "What is this tutorial about?"
query # Run the query through the RetrievalQA instance and store the response
= qa_stuff.run(query)
response # Print the response to the console
print(response)
> Entering new chain...
> Finished chain.
This tutorial is about a data pre-processing technique for machine learning called splitting your data. It focuses on splitting a data set into a training set and a testing set to avoid overfitting and to assess the model's performance on unseen data. The tutorial also covers when to split the data in the machine learning workflow and how to handle categorical variables using one-hot encoding. Additionally, it explains how to adjust the size of the training and testing sets and how to ensure reproducibility by setting a random seed.
# Set the query to be used for the QA system
= "What is the difference between a training set and test set?"
query # Run the query through the RetrievalQA instance and store the response
= qa_stuff.run(query)
response # Print the response to the console
print(response)
> Entering new chain...
> Finished chain.
The training set is used to train the machine learning model, meaning the model learns patterns and relationships from this data. The test set, on the other hand, is used to evaluate the performance of the trained model on unseen data. This helps assess how well the model generalizes to new data and gives an indication of its predictive accuracy.
# Set the query to be used for the QA system
= "Who should watch this lesson?"
query # Run the query through the RetrievalQA instance and store the response
= qa_stuff.run(query)
response # Print the response to the console
print(response)
> Entering new chain...
> Finished chain.
This lesson on splitting data into training and testing sets is beneficial for individuals who are learning about data pre-processing techniques for machine learning. It is particularly useful for those who are new to machine learning and want to understand the importance of splitting data to avoid overfitting and data leakage. Additionally, individuals who are interested in understanding how to implement train-test-split in Python using libraries like pandas and scikit-learn would find this tutorial helpful.
# Set the query to be used for the QA system
= "Who is the greatest football team on earth?"
query # Run the query through the RetrievalQA instance and store the response
= qa_stuff.run(query)
response # Print the response to the console
print(response)
> Entering new chain...
> Finished chain.
I don't know the answer to that question as it is subjective and varies depending on personal preferences and opinions.
# Set the query to be used for the QA system
= "How long is the circumference of the earth?"
query # Run the query through the RetrievalQA instance and store the response
= qa_stuff.run(query)
response # Print the response to the console
print(response)
> Entering new chain...
> Finished chain.
I don't know the exact length of the circumference of the Earth.