Image courtesy of the author
Julia is a programming language like Python and R. It combines the speed of low-level languages like C with the simplicity of Python. Julia is becoming increasingly popular in the field of data science. If you’re looking to expand your portfolio and learn a new language, Julia is the place to be.
In this tutorial, you’ll learn how to set up Julia for data science, load data, perform data analysis, and visualize it. The tutorial is simple enough that students can get started with data analysis using Julia in under five minutes.
1. Setting up the environment
Download Julia and install the package from julialang.org. Next, you need to set up Julia for Jupyter Notebook. Start a terminal (PowerShell) and type `julia` to start the Julia REPL, then enter the following command: using Pkg Pkg.add(“IJulia”)
Launch Jupyter Notebook and start a new notebook with Julia as the kernel. Create a new code cell and enter the following commands to install the required data science packages: using Pkg Pkg.add(“DataFrames”) Pkg.add(“CSV”) Pkg.add(“Plots”) Pkg.add(“Chain”)
2. Loading data
In this example, we use the Kaggle Online Sales dataset, which contains data about online sales transactions across different product categories.
Read CSV files and convert them into DataFrames similar to Pandas DataFrames.
Using CSV Using DataFrames # Read a CSV file into a DataFrame data = CSV.read(“Online Sales Data.csv”, DataFrame)
3. Explore the data
To display the top 5 rows of a DataFrame, use the ‘first’ function instead of `head`.
To generate an overview of your data, use the `describe` function.
Similar to a Pandas DataFrame, you can display specific values by specifying the row number and column name.
output:
4. Data manipulation
To filter data based on a specific value, use the `filter` function, which requires column names, a condition, a value, and a DataFrame.
filterData = filter(row -> row[:”Unit Price”] > 230, data) last(filtered_data, 5)
You can also create new columns similar to Pandas, it’s very easy:
data[!, :”Total Revenue After Tax”] = Data[!, :”Total Revenue”] .* 0.9 last(data, 5)
Now we calculate the average value of “Total Revenue After Tax” based on different “Product Categories”.
Use statistics grouped_data = groupby(data, :”product category”) aggregated_data = combine(grouped_data, :”total revenue after tax” .=> mean) last(aggregated_data, 5)
5. Visualization
Visualization is similar to Seaborn. In this example, we visualize a bar chart of recently created aggregated data. Specify X and Y columns, and provide title and labels.
Using Plots # Basic plotting bar(aggregated_data[!, :”Product Category”],Aggregated data[!, :”Total Revenue After Tax_mean”]title=”Product Analysis”, xlabel=”Product Category”, ylabel=”Average After-Tax Total Revenue”)
The majority of the total average revenue is generated from electronic devices. The visualization is perfect and clear.
To generate a histogram, you need to provide the X column and label data. It visualizes the frequency of products sold.
Histogram (data[!, :”Units Sold”]title=”Sales Volume Analysis”, xlabel=”Sales Volume”, ylabel=”Frequency”)
It seems most people bought one or two items.
To save a visualization, use the `savefig` function.
6. Creating a Data Processing Pipeline
Creating a proper data pipeline is necessary to automate data processing workflows, ensure data consistency, and enable scalable and efficient data analysis.
Using the `Chain` library, we create a chain of different functions that were used earlier to calculate the total average revenue based on different product categories.
Using Chain # Example of a simple data processing pipeline processing_data = @chain data begin filter(row -> row[:”Unit Price”] > 230, _) groupby(_, :”product category”) combine(_, :”total revenue” => mean) end first(processed_data, 5)
To save the processed DataFrame as a CSV file, use the `CSV.write` function.
CSV.write(“output.csv”, processed data)
Conclusion
In my opinion, Julia is simpler and faster than Python. Many of the syntax and functions I am familiar with are available in Julia as well, such as Pandas, Seaborn, Scikit-Learn, etc. So, why not learn a new language to outperform your peers? It will also help you land a research-related job, as most clinical researchers prefer Julia over Python.
In this tutorial, you learned how to set up your Julia environment, load a dataset, perform powerful data analysis and visualization, and build a data pipeline for reproducibility and reliability. If you want to learn more about Julia for Data Science, let me know and I can write more short tutorials for you guys.
Abid Ali Awan (@1abidaliawan) is a Certified Data Scientist professional who loves building machine learning models. Currently, he focuses on content creation and technical blogging on Machine Learning and Data Science techniques. Abid holds a Masters in Technology Management and a Bachelors in Communication Engineering. His vision is to build AI products using Graph Neural Networks for students suffering from mental illness.