A complete guide to big data analysis using Apache Hadoop (HDFS) and PySpark library in Python on game reviews on the Steam gaming platform.
With over 100 zettabytes (= 10¹²GB) of data produced every year around the world, the significance of handling big data is one of the most required skills today. Data Analysis, itself, could be defined as the ability to handle big data and derive insights from the never-ending and exponentially growing data. Apache Hadoop and Apache Spark are two of the basic tools that help us untangle the limitless possibilities hidden in large datasets. Apache Hadoop enables us to streamline data storage and distributed computing with its Distributed File System (HDFS) and the MapReduce-based parallel processing of data. Apache Spark is a big data analytics engine capable of EDA, SQL analytics, Streaming, Machine Learning, and Graph processing compatible with the major programming languages through its APIs. Both when combined form an exceptional environment for dealing with big data with the available computational resources — just a personal computer in most cases!
Let us unfold the power of Big Data and Apache Hadoop with a simple analysis project implemented using Apache Spark in Python.
To begin with, let’s dive into the installation of Hadoop Distributed File System and Apache Spark on a MacOS. I am using a MacBook Air with macOS Sonoma with an M1 chip.
Jump to the section —
Installing Hadoop Distributed File SystemInstalling Apache SparkSteam Review Analysis using PySparkWhat next?
1. Installing Hadoop Distributed File System
Thanks to Code With Arjun for the amazing article that helped me with the installation of Hadoop on my Mac. I seamlessly installed and ran Hadoop following his steps which I will show you here as well.
a. Installing HomeBrew
I use Homebrew for installing applications on my Mac for ease. It can be directly installed on the system with the below code —
/bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)”
Once it is installed, you can run the simple code below to verify the installation.
brew –versionFigure 1: Image by Author
However, you will likely encounter an error saying, command not found, this is because the homebrew will be installed in a different location (Figure 2) and it is not executable from the current directory. For it to function, we add a path environment variable for the brew, i.e., adding homebrew to the .bash_profile.
Figure 2: Image by Author
You can avoid this step by using the full path to Homebrew in your commands, however, it might become a hustle at later stages, so not recommended!
echo ‘eval “$(/opt/homebrew/bin/brew shellenv)”’ >> /Users/rindhujajohnson/.bash_profile
eval “$(/opt/homebrew/bin/brew shellenv)”
Now, when you try, brew –version, it should show the Homebrew version correctly.
b. Installing Hadoop
Disclaimer! Hadoop is a Java-based application and is supported by a Java Development Kit (JDK) version older than 11, preferably 8 or 11. Install JDK before continuing.
Thanks to Code With Arjun again for this video on JDK installation on MacBook M1.
Guide to Installing JDK
Now, we install the Hadoop on our system using the brew command.
brew install hadoop
This command should install Hadoop seamlessly. Similar to the steps followed while installing HomeBrew, we should edit the path environment variable for Java in the Hadoop folder. The environment variable settings for the installed version of Hadoop can be found in the Hadoop folder within HomeBrew. You can use which hadoop command to find the location of the Hadoop installation folder. Once you locate the folder, you can find the variable settings at the below location. The below command takes you to the required folder for editing the variable settings (Check the Hadoop version you installed to avoid errors).
cd /opt/homebrew/Cellar/hadoop/3.3.6/libexec/etc/hadoop
You can view the files in this folder using the ls command. We will edit the hadoop-env.sh to enable the proper running of Hadoop on the system.
Figure 3: Image by Author
Now, we have to find the path variable for Java to edit the hadoop-ev.sh file using the following command.
/usr/libexec/java_homeFigure 4: Image by Author
We can open the hadoop-env.sh file in any text editor. I used VI editor, you can use any editor for the purpose. We can copy and paste the path — Library/Java/JavaVirtualMachines/adoptopenjdk-11.jdk/Contents/Home at the export JAVA_HOME = position.
Figure 5: hadoop-env.sh file opened in VI Text Editor
Next, we edit the four XML files in the Hadoop folder.
core-site.xml
fs.defaultFS
hdfs://localhost:9000
hdfs-site.xml
dfs.replication
1
mapred-site.xml
mapreduce.framework.name
yarn
mapreduce.application.classpath
$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
yarn-site.xml
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.env-whitelist
JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME
With this, we have successfully completed the installation and configuration of HDFS on the local. To make the data on Hadoop accessible with Remote login, we can go to Sharing in the General settings and enable Remote Login. You can edit the user access by clicking on the info icon.
Figure 6: Enable Remote Access. Image by Author
Let’s run Hadoop!
Execute the following commands
hadoop namenode -format # starts the Hadoop environment
% start-all.sh
# Gathers all the nodes functioning to ensure that the installation was successful
% jps
Figure 7: Initiating Hadoop and viewing the nodes and resources running. Image by Author
We are all set! Now let’s create a directory in HDFS and add the data will be working on. Let’s quickly take a look at our data source and details.
Data
The Steam Reviews Dataset 2021 (License: GPL 2) is a collection of reviews from about 21 million gamers covering over 300 different games in the year 2021. the data is extracted using Steam’s API — Steamworks — using the Get List function.
GET store.steampowered.com/appreviews/?json=1
The dataset consists of 23 columns and 21.7 million rows with a size of 8.17 GB (that is big!). The data consists of reviews in different languages and a boolean column that tells if the player recommends the game to other players. We will be focusing on how to handle this big data locally using HDFS and analyze it using Apache Spark in Python using the PySpark library.
c. Uploading Data into HDFS
Firstly, we create a directory in the HDFS using the mkdir command. It will throw an error if we try to add a file directly to a non-existing folder.
hadoop fs -mkdir /user
hadoop fs -mkdir /user/steam_analysis
Now, we will add the data file to the folder steam_analysis using the put command.
hadoop fs -put /Users/rindhujajohnson/local_file_path/steam_reviews.csv /user/steam_analysis
Apache Hadoop also uses a user interface available at http://localhost:9870/.
Figure 8: HDFS User Interface at localhost:9870. Image by Author
We can see the uploaded files as shown below.
Figure 10: Navigating files in HDFS. Image by Author
Once the data interaction is over, we can use stop-all.sh command to stop all the Apache Hadoop daemons.
Let us move to the next step — Installing Apache Spark
2. Installing Apache Spark
Apache Hadoop takes care of data storage (HDFS) and parallel processing (MapReduce) of the data for faster execution. Apache Spark is a multi-language compatible analytical engine designed to deal with big data analysis. We will run the Apache Spark on Python in Jupyter IDE.
After installing and running HDFS, the installation of Apache Spark for Python is a piece of cake. PySpark is the Python API for Apache Spark that can be installed using the pip method in the Jupyter Notebook. PySpark is the Spark Core API with its four components — Spark SQL, Spark ML Library, Spark Streaming, and GraphX. Moreover, we can access the Hadoop files through PySpark by initializing the installation with the required Hadoop version.
# By default, the Hadoop version considered will be 3 here.
PYSPARK_HADOOP_VERSION=3 pip install pyspark
Let’s get started with the Big Data Analytics!
3. Steam Review Analysis using PySpark
Steam is an online gaming platform that hosts over 30,000 games streaming across the world with over 100 million players. Besides gaming, the platform allows the players to provide reviews for the games they play, a great resource for the platform to improve customer experience and for the gaming companies to work on to keep the players on edge. We used this review data provided by the platform publicly available on Kaggle.
3. a. Data Extraction from HDFS
We will use the PySpark library to access, clean, and analyze the data. To start, we connect the PySpark session to Hadoop using the local host address.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# Initializing the Spark Session
spark = SparkSession.builder.appName(“SteamReviewAnalysis”).master(“yarn”).getOrCreate()
# Providing the url for accessing the HDFS
data = “hdfs://localhost:9000/user/steam_analysis/steam_reviews.csv”
# Extracting the CSV data in the form of a Schema
data_csv = spark.read.csv(data, inferSchema = True, header = True)
# Visualize the structure of the Schema
data_csv.printSchema()
# Counting the number of rows in the dataset
data_csv.count() # 40,848,659
3. b. Data Cleaning and Pre-Processing
We can start by taking a look at the dataset. Similar to the pandas.head() function in Pandas, PySpark has the SparkSession.show() function that gives a glimpse of the dataset.
Before that, we will remove the reviews column in the dataset as we do not plan on performing any NLP on the dataset. Also, the reviews are in different languages making any sentiment analysis based on the review difficult.
# Dropping the review column and saving the data into a new variable
data = data_csv.drop(“review”)
# Displaying the data
data.show()
Figure 11: The Structure of the Schema
We have a huge dataset with us with 23 attributes with NULL values for different attributes which does not make sense to consider any imputation. Therefore, I have removed the records with NULL values. However, this is not a recommended approach. You can evaluate the importance of the available attributes and remove the irrelevant ones, then try imputing data points to the NULL values.
# Drops all the records with NULL values
data = data.na.drop(how = “any”)
# Count the number of records in the remaining dataset
data.count() # 16,876,852
We still have almost 17 million records in the dataset!
Now, we focus on the variable names of the dataset as in Figure 11. We can see that the attributes have a few characters like dot(.) that are unacceptable as Python identifiers. Also, we change the data type of the date and time attributes. So we change these using the following code —
from pyspark.sql.types import *
from pyspark.sql.functions import from_unixtime
# Changing the data type of each columns into appropriate types
data = data.withColumn(“app_id”,data[“app_id”].cast(IntegerType())).\
withColumn(“author_steamid”, data[“author_steamid”].cast(LongType())).\
withColumn(“recommended”, data[“recommended”].cast(BooleanType())).\
withColumn(“steam_purchase”, data[“steam_purchase”].cast(BooleanType())).\
withColumn(“author_num_games_owned”, data[“author_num_games_owned”].cast(IntegerType())).\
withColumn(“author_num_reviews”, data[“author_num_reviews”].cast(IntegerType())).\
withColumn(“author_playtime_forever”, data[“author_playtime_forever”].cast(FloatType())).\
withColumn(“author_playtime_at_review”, data[“author_playtime_at_review”].cast(FloatType()))
# Converting the time columns into timestamp data type
data = data.withColumn(“timestamp_created”, from_unixtime(“timestamp_created”).cast(“timestamp”)).\
withColumn(“author_last_played”, from_unixtime(data[“author_last_played”]).cast(TimestampType())).\
withColumn(“timestamp_updated”, from_unixtime(data[“timestamp_updated”]).cast(TimestampType()))
Figure 12: A glimpse of the Steam review Analysis dataset. Image by Author
The dataset is clean and ready for analysis!
3. c. Exploratory Data Analysis
The dataset is rich in information with over 20 variables. We can analyze the data from different perspectives. Therefore, we will be splitting the data into different PySpark data frames and caching them to run the analysis faster.
# Grouping the columns for each analysis
col_demo = [“app_id”, “app_name”, “review_id”, “language”, “author_steamid”, “timestamp_created” ,”author_playtime_forever”,”recommended”]
col_author = [“steam_purchase”, ‘author_steamid’, “author_num_games_owned”, “author_num_reviews”, “author_playtime_forever”, “author_playtime_at_review”, “author_last_played”,”recommended”]
col_time = [ “app_id”, “app_name”, “timestamp_created”, “timestamp_updated”, ‘author_playtime_at_review’, “recommended”]
col_rev = [ “app_id”, “app_name”, “language”, “recommended”]
col_rec = [“app_id”, “app_name”, “recommended”]
# Creating new pyspark data frames using the grouped columns
data_demo = data.select(*col_demo)
data_author = data.select(*col_author)
data_time = data.select(*col_time)
data_rev = data.select(*col_rev)
data_rec = data.select(*col_rec)
i. Games Analysis
In this section, we will try to understand the review and recommendation patterns for different games. We will consider the number of reviews analogous to the popularity of the game and the number of True recommendations analogous to the gamer’s preference for the game.
Finding the Most Popular Games# the data frame is grouped by the game and the number of occurrences are counted
app_names = data_rec.groupBy(“app_name”).count()
# the data frame is ordered depending on the count for the highest 20 games
app_names_count = app_names.orderBy(app_names[“count”].desc()).limit(20)
# a pandas data frame is created for plotting
app_counts = app_names_count.toPandas()
# A pie chart is created
fig = plt.figure(figsize = (10,5))
colors = sns.color_palette(“muted”)
explode = (0.1,0.075,0.05,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
plt.pie(x = app_counts[“count”], labels = app_counts[“app_name”], colors = colors, explode = explode, shadow = True)
plt.title(“The Most Popular Games”)
plt.show()
Finding the Most Recommended Games# Pick the 20 highest recommended games and convert it in to pandas data frame
true_counts = data_rec.filter(data_rec[“recommended”] == “true”).groupBy(“app_name”).count()
recommended = true_counts.orderBy(true_counts[“count”].desc()).limit(20)
recommended_apps = recommended.toPandas()
# Pick the games such that both they are in both the popular and highly recommended list
true_apps = list(recommended_apps[“app_name”])
true_app_counts = data_rec.filter(data_rec[“app_name”].isin(true_apps)).groupBy(“app_name”).count()
true_app_counts = true_app_counts.orderBy(true_app_counts[“count”].desc())
true_app_counts = true_app_counts.toPandas()
# Evaluate the percent of true recommendations for the top games and sort them
true_perc = []
for i in range(0,20,1):
percent = (true_app_counts[“count”][i]-recommended_apps[“count”][i])/true_app_counts[“count”][i]*100
true_perc.append(percent)
recommended_apps[“recommend_perc”] = true_perc
recommended_apps = recommended_apps.sort_values(by = “recommend_perc”, ascending = False)
# Built a pie chart to visualize
fig = plt.figure(figsize = (10,5))
colors = sns.color_palette(“muted”)
explode = (0.1,0.075,0.05,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
plt.pie(x = recommended_apps[“recommend_perc”], labels = recommended_apps[“app_name”], colors = colors, explode = explode, shadow = True)
plt.title(“The Most Recommended Games”)
plt.show()