Are you familiar with the data world using Python? If so, most people probably use Pandas for data manipulation.
For those of you who don’t know, Pandas is an open-source Python package developed specifically for analyzing and manipulating data. It is one of the most commonly used packages and is often learned when starting to study data science in Python.
So what is Pandas AI? You’re probably reading this article because you want to know what it is.
As we know, we are in an era where generative AI is everywhere. If we could use generative AI to perform analysis on our data, things would become much easier.
This is what Pandas AI brings to the table: simple prompts let you quickly analyze and manipulate datasets without sending the data anywhere.
In this article, you will learn how to leverage Pandas AI for your data analysis tasks. In this article, you will learn:
Setting up Pandas AI Data Exploration with Pandas AI Data Visualization with Pandas AI Advanced Use of Pandas AI
If you’re ready to learn, let’s get started!
Pandas AI is a Python package that implements Large Scale Language Modeling (LLM) capabilities into the Pandas API. It allows you to use the standard Pandas API with Generative AI extensions that turn Pandas into a conversational tool.
The reason we mainly use Pandas AI is because of the simplicity of the process the package provides: the package can automatically analyze data with simple prompts, without the need for complex code.
Enough with the introduction, let’s try it out.
First and foremost, you need to install the package.
Next, we need to configure the LLM to use for Pandas AI. There are several options, including OpenAI GPT and HuggingFace. However, we will use OpenAI GPT in this tutorial.
Setting up an OpenAI model in Pandas AI is easy, but you will need an OpenAI API key, if you don’t have one you can get one on their website.
Once you are ready, let’s set up Pandas AI LLM using the code below.
Import OpenAI from pandasai.llm llm = OpenAI(api_token=”OpenAI API key”)
Now you are ready to use Pandas AI to do some data analysis.
Data Exploration with Pandas AI
Let’s try out some data exploration with Pandas AI, starting with a sample dataset. In this example, we’ll use the Titanic data from the Seaborn package.
Import seaborn as sns. Import SmartDataframe from pandasai. data = sns.load_dataset(‘titanic’) df = SmartDataframe(data, config = {‘llm’: llm})
To launch Pandas AI, you need to pass these into a Pandas AI Smart Data Frame object, and then you can perform conversation activities on the DataFrame.
Let’s try some easy questions.
response = df.chat(“””Returns the surviving classes as a percentage”””) response
The percentage of passengers who survived was 38.38%.
Pandas AI was able to come up with solutions from the prompts and answer our questions.
You can ask the Pandas AI questions and get answers in a DataFrame object. For example, here are some prompts to analyze your data:
#Data summary summary = df.chat(“””Can I get a statistical summary of my dataset?”””) #Class Proportions surv_pclass_perc = df.chat(“””Returns the breakdown of survival rates by pclass”””) #Missing Data missing_data_perc = df.chat(“””Returns the percentage of missing data for a column”””) #Outlier Data outlier_fare_data = response = df.chat(“””Please provide a data row containing outlier data based on the fare column”””)
Image courtesy of the author
From the image above, we can see that even if the prompt is very complex, Pandas AI can provide the information using a DataFrame object.
However, Pandas AI cannot handle overly complex calculations because the package is limited to LLM passing in a SmartDataFrame object. In the future, as the LLM function evolves, we believe that Pandas AI will be able to handle more detailed analysis.
Data Visualization with Pandas AI
Pandas AI helps in data exploration and can perform data visualization, you can provide a prompt and Pandas AI will provide you with the visualization output.
Let’s try a simple example.
response = df.chat(‘Please provide a visualization of the fare data distribution’) response
Image courtesy of the author
In the above example, we are asking the Pandas AI to visualize the distribution of the fare column. The output is a bar chart distribution from the dataset.
Similar to data exploration, you can perform any kind of data visualization, however, Pandas AI cannot handle more complex visualization processes.
Here are some other examples of data visualization using Pandas AI:
kde_plot = df.chat(“””Please plot the kde distribution of age column and separate it with survival column”””) box_plot = df.chat(“””Please return a box plot visualization of age column separated by gender”””) heat_map = df.chat(“””Please give me a heat map plot to visualize correlation of numerical columns”””) count_plot = df.chat(“””Please visualize categorical column gender and survival”””)
Image courtesy of the author
The plot looks good and is uncluttered, and we can continue asking the Pandas AI for more details if we want.
Pandas AI drives adoption
You can use several built-in APIs in Pandas AI to enhance your Pandas AI experience.
Clearing the cache
By default, all prompts and results from Pandas AI objects are saved to a local directory to reduce processing time and the time required for Pandas AI to invoke the model.
However, this cache can cause the results of Pandas AI to become irrelevant because it takes into account past results. Therefore, we recommend that you clear the cache. You can clear the cache using the following code:
Import pandasai as pai. pai.clear_cache()
You might want to turn off caching first.
df = SmartDataframe(data, {“enable_cache”: False})
This method does not save any prompts or results from the beginning.
Custom Head
You can pass a sample head DataFrame to Pandas AI. This is useful if you don’t want to share some private data with LLM or want to provide examples to Pandas AI.
To do that you can use the following code:
Import SmartDataframe from pandasai and import pandas as pd. # head df head_df = data.sample(5) df = SmartDataframe(data, config={ “custom_head”: head_df, ‘llm’: llm })
Pandas AI Skills and Agents
Pandas AI allows users to pass in an example function and run it on the agent’s decision. For example, the function below joins two different DataFrames and passes in an example plot function for the Pandas AI agent to run.
import pandas as pd ; import Agent from pandasai ; import skill from pandasai.skills ; employees_data = { “EmployeeID”: [1, 2, 3, 4, 5]”name”: [“John”, “Emma”, “Liam”, “Olivia”, “William”]”department”: [“HR”, “Sales”, “IT”, “Marketing”, “Finance”]} salaries_data = { “employee_id”: [1, 2, 3, 4, 5]”salary”: [5000, 6000, 4500, 7000, 5500]} employees_df = pd.DataFrame(employees_data) salaries_df = pd.DataFrame(salaries_data) # Function docstring to give more context to the model about the usage of this skill @skill def plot_salaries(names: list[str]Salary: List[int]): “”” Display a bar graph with name on the x-axis and salary on the y-axis. Arguments: name (list[str]: Employee Name Salary (List[int]): salary “”” # plot the bars import matplotlib.pyplot as plt plt.bar(names, salaries) plt.xlabel(“Employee name”) plt.ylabel(“Salary”) plt.title(“Employee salary”) plt.xticks(rotation=45) # add counts above each bar for i, salary in enumerate(salaries): plt.text(i, salary + 1000, str(salary), ha=”center”, va=”bottom”) plt.show() agent = Agent([employees_df, salaries_df]config = {‘llm’: llm}) agent.add_skills(plot_salaries) response = agent.chat(“Plot employee salaries against their names”)
The agent decides whether to use the function you assigned to the Pandas AI.
Combining skills and agents gives you more control over the results of your DataFrame analysis.
With Pandas AI, you can see how easy your data analysis work can be. By leveraging the power of LLM, you can limit the coding part of your data analysis work and instead focus on the work that matters.
In this article, we’ve covered how to set up Pandas AI, how to use it to perform data exploration and visualization, and some advanced usage. There’s a lot more you can do with this package, so check out the documentation for more information.
Cornellius Yudha Wijaya is a Data Science Assistant Manager and Data Writer. He works full time at Allianz Indonesia and loves sharing Python and Data tips through social and writing media. Cornellius writes about various topics related to AI and Machine Learning.