Querying the Database in LangChain Style

April 21, 2024 · 7 min read

Senior Cloud Solutions Architect @ Insum

Robots communicating with a tin cans

Staring from langchain_community version 0.0.31, Oracle introduced a new LangChain document loader that allows you to load the results of an SQL query, against an Oracle Autonomous Database (ADB), as documents that your retrieval-augmented generation (RAG) application can use for text generation tasks. In a previous article, I described how the select ai feature available in ADBs combines the use of database metadata and large-language models (LLMs) to generate an SQL statement that answers a question asked in natural language. In this article, we will take a quick look at how we can load data from the ADB, generate embeddings, then use an LLM to respond to human questions.

As usual, we'll start with installing the required Python libraries using your favourite tool, such as Conda. At the time of writing, the latest versions for langchain and langchain-community is 0.1.16 and 0.0.32 respectively.

conda install -c conda-forge \
    langchain=0.1.14 \
    langchain-community=0.0.31 \
    faiss=1.7.4

Then download the ADB's database wallet and unzip the files to the desired path, for example, /path/to/adb_config. Create a file to store all required database connection parameters. I store these in a file named app.ini, and included other application parameters. These values will be loaded by the configuration file parser library.

[ADB]
user=mydbuser
password=mydbuserpassword
tns_name=myadb_low
config_dir=/path/to/adb_config
wallet_location=/path/to/adb_config
wallet_password=myadbwalletpassword

[GENAI]
auth_type=API_KEY
auth_profile=myociauthprofile
compartment_id=ocid1.compartment.oc1...
embed_model_id=cohere.embed-english-v3.0
inference_model_id=cohere.command

In this article's example, the driver connects to the ADB in the "thin mode". Specifying the config_dir, wallet_location, wallet_password, and tns_name are all required. You may test that the database connection can be established successfully using the snippet below:

import configparser
import oracledb

config = configparser.ConfigParser()
config.read("app.ini")

connection = oracledb.connect(
    config_dir=config["ADB"]["config_dir"],
    user=config["ADB"]["user"],
    password=config["ADB"]["password"],
    dsn=config["ADB"]["tns_name"],
    wallet_location=config["ADB"]["wallet_location"],
    wallet_password=config["ADB"]["wallet_password"]
)

print("Database version {version}".format(version=connection.version))
connection.close()

With the database connection parameters validated, proceed to load some data from the ADB. Here, I used a used cars price prediction dataset that was previously obtained from Kaggle for another machine learning experiment. In the code below, I provide a simple SQL query to retrieve details about the first 96 cars in the dataset, generate the embeddings, and then create the vector store and retriever. The code is commented to explain each step. 96 is the maximum number of string inputs per call that can be submitted to the OCI Generative AI (GenAI) service's embeddings API. I had highlighted this limitation in the previous post as well. We can of course load more data, it just requires more thought and code.

from langchain_community.document_loaders import OracleAutonomousDatabaseLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OCIGenAIEmbeddings
import configparser

# Read the configuration.
config = configparser.ConfigParser()
config.read("app.ini")

# Prepare the SQL query.
SQL_QUERY = """
select id, yr_mfr, fuel_type, transmission 
from used_cars_train 
order by id
fetch first 96 rows only
"""

# Instantiate an Oracle ADB document loader.
doc_loader = OracleAutonomousDatabaseLoader(
    query=SQL_QUERY,
    user=config["ADB"]["user"],
    password=config["ADB"]["password"],
    config_dir=config["ADB"]["config_dir"],
    wallet_location=config["ADB"]["wallet_location"],
    wallet_password=config["ADB"]["wallet_password"],
    tns_name=config["ADB"]["tns_name"],
)

# Load the documents.
docs = doc_loader.load()

# Instantiate the OCI GenAI embeddings model.
embeddings = OCIGenAIEmbeddings(
    auth_type=config["GENAI"]["auth_type"],
    auth_profile=config["GENAI"]["auth_profile"],
    model_id=config["GENAI"]["embed_model_id"],
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
    compartment_id=config["GENAI"]["compartment_id"]
)

# Create the vector store.
vectorstore = FAISS.from_documents(
    documents=docs,
    embedding=embeddings
)

# Return the retriever.
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 3, "fetch_k": 30, "lambda_mult": 0.8 }
)

For the retriever, rather than use the default similarity search, I used the maximal marginal relevance (MMR) search type, and tweaked it to use the top 3 search results out of 30 documents selected for almost minimum diversity.

The next block of code creates the chain responsible for generating the answers to users' questions.

from langchain.schema.runnable import RunnablePassthrough
from langchain_community.llms import OCIGenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

template = """Answer the question and explain based only on the following context:
{context}
 
Question: {question}
"""
prompt = PromptTemplate.from_template(template)

llm = OCIGenAI(
    auth_type=config["GENAI"]["auth_type"],
    auth_profile=config["GENAI"]["auth_profile"],
    model_id=config["GENAI"]["inference_model_id"],
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
    compartment_id=config["GENAI"]["compartment_id"],
    model_kwargs={"temperature": 0.0, "max_tokens": 200}
)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Then we ask our question.

print(chain.invoke("How many cars were manufactured in 2013?"))

And based on the data we get the following results:

Based on the provided context, there were 3 cars manufactured in 2013. 

This is deduced from the fact that you have provided 3 different car objects, each with a manufacturing date of 2013. The information for each car is represented in the following documents:
- ID: 48, manufactured in 2013 with a diesel fuel type and manual transmission
- ID: 18, manufactured in 2013 with a petrol fuel type and manual transmission
- ID: 4, manufactured in 2013 with a diesel fuel type and manual transmission

Would you like me to provide additional information about any of these car objects? 

If I queried the database and counted the number of rows where YR_MFR = 2013, the answer would be 696, so how did the LLM figure out that there are only 3? To understand how it arrived at this answer, I looked at the what was presented to the LLM as the context. To do that, I iterated through all the documents in the store.

for doc in docs:
    print(doc.page_content)

Each "document" was the JSON value of each row in the table, and below are the first ten documents that the retriever would search against.

{'ID': 1, 'YR_MFR': 2015, 'FUEL_TYPE': 'petrol', 'TRANSMISSION': 'manual'}
{'ID': 2, 'YR_MFR': 2016, 'FUEL_TYPE': 'petrol', 'TRANSMISSION': 'manual'}
{'ID': 3, 'YR_MFR': 2017, 'FUEL_TYPE': 'petrol', 'TRANSMISSION': 'manual'}
{'ID': 4, 'YR_MFR': 2013, 'FUEL_TYPE': 'diesel', 'TRANSMISSION': 'manual'}
{'ID': 5, 'YR_MFR': 2015, 'FUEL_TYPE': 'petrol', 'TRANSMISSION': 'manual'}
{'ID': 6, 'YR_MFR': 2018, 'FUEL_TYPE': 'petrol', 'TRANSMISSION': None}
{'ID': 7, 'YR_MFR': 2012, 'FUEL_TYPE': 'diesel', 'TRANSMISSION': 'manual'}
{'ID': 8, 'YR_MFR': 2012, 'FUEL_TYPE': 'petrol', 'TRANSMISSION': 'manual'}
{'ID': 9, 'YR_MFR': 2014, 'FUEL_TYPE': 'diesel', 'TRANSMISSION': 'manual'}
{'ID': 10, 'YR_MFR': 2018, 'FUEL_TYPE': 'petrol', 'TRANSMISSION': 'manual'}

As mentioned, I had used the MMR search type with the additional parameters set {"k": 3, "fetch_k": 30, "lambda_mult": 0.8 }. The retriever was ranking a set of 30 out of 96 rows that matched my desired year 2013, and then returned the top 3 in the context. Hence, the LLM is really only seeing that I have three documents that match the criteria, and deduced that only 3 cars were manufactured that year.

I modified the SQL query to create a summary information about the dataset to see if I could get a more meaningful response.

SQL_QUERY = """
select yr_mfr, fuel_type, count(1) as number_of_cars_manufactured 
from used_cars_train 
group by yr_mfr, fuel_type
"""

This time, it was better at figuring out that there were, correctly 324 diesel, 342 petrol, and 29 cars that use a mix of fuel source. However, it still missed one more data point, and it is obviously bad at doing arithmetic.

The total number of cars manufactured in 2013 is 422. 

This is derived from the sum of the values of the NUMBER_OF_CARS_MANUFACTURED from each document, which are 324 from diesel, 342 from petrol, and 29 from petrol and cng. 

Would you like me to breakdown the different fuel types manufactured in 2013? 

Here are the expected values if we asked a similar in SQL.

column fuel_type format a30;
select yr_mfr, fuel_type, count(1) as number_of_cars 
from used_cars_train 
where yr_mfr = 2013
group by yr_mfr, fuel_type
order by yr_mfr;

    YR_MFR FUEL_TYPE                      NUMBER_OF_CARS
---------- ------------------------------ --------------
      2013 diesel                                    324
      2013 petrol & lpg                                1
      2013 petrol                                    342
      2013 petrol & cng                               29

This post took longer than usual for me to write as the outcome wasn't as positive as most of people would expect. And I think that's because I am using the technology for the wrong use case and data set. If the LangChain application included an agent that first constructed an appropriate SQL statement using a LLM, and then executing the query, perhaps the outcome might be more desirable. Or, perhaps if each row of data included a column of unstructured text describing the look, features, and condition of the car, then the LLM could respond with a list of suitable vehicles that I might want to purchase.

A good time to remember not to "GenAI everything"!

Nevertheless, this was a great learning opportunity, and also to understand how to work with the new ADB document loader. I hope you have benefited too. Thanks for reading!

Credits

Banner image generated using Stability AI's Stable Diffusion model.