Your ability to trust the results from your data largely depends on the quality of the data. It’s garbage-in, garbage-out! In this tutorial, I will demonstrate to you how to do data cleaning in SPSS, from removing irrelevant cases, to converting data types, detecting and removing duplicates, fixing structural issues like typos, generating value sets from text variables, fixing outliers and dealing with missing values.
You probably have heard the term garbage in garbage out. Well, that’s the truth about working with data.
If you use garbage data, you get garbage results. And no one wants to make decisions based on trashy data.
This is why data cleaning is an extremely important step in data analytics.
Data quality issues are very common in data that has been collected through surveys, or imported from other formats for example databases or Microsoft Excel worksheets.
In this video, I will point you to data quality issues you need to look out for, and how you can fix them using SPSS.
First, what is data cleaning?
Well, data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.
With that definition, you should have an idea already about what is involved in data cleaning.
That’s what I am going to be showing you in this video
Starting with data quality issue number 1: Irrelevant data
Data can be irrelevant if it is not of interest to the analysis you are trying to do.
So this could be either irrelevant variables or irrelevant cases.
I have here this data from a fictitious survey I conducted using a data collection platform called KoBo Toolbox.
Already you can see that we have variables that were automatically generated by the platform for example the start and end time of the interview here.
If you scroll to the end
You will also notice that w have more of these for example the id, uuid, submission time, validation status and so on
Dealing with irrelevant variables is simple. Just delete them
I will scroll to the beginning
Right-click on the variable name
And select clear
Lets do the same with the end time variable as well
We can also delete multiple variables at the same time
Let’s select from @iD to @tags
There are several was you can do that.
But the easiest and most consistent way is to click the first variable which in this case is @ID
The press the control key on the keyboard and hold it
While the control key is still pressed, go ahead and click all the other columns
And select clear
I left the index variable deliberately because we may need it to uniquely identify each case in the data set
While we are talking about irrelevant data, let’s look at how we may remove irrelevant cases
Let’s suppose that we are only interested in analyzing data for those who completed tertiary education
There are multiple ways to do this, but the easiest is to simply sort the data by the said variable, and delete the rest of the cases that are not of interest to your analysis.
I must say though that it’s wise to backup the data before you delete anything from your dataset.
So in this case, we will right-click the What is your highest level of education column,
Choose sort ascending
Now quickly click and drag from case number 1 to case number 10
Now in this case, the variable what is your highest education level is useless wince we only have 1 value. So let’s go ahead and delete it
Data quality issue number 2 is incorrect data formats
It is very common for data tat was imported from other data software to be improperly formatted in SPSS.
The most common one is having numeric values being designated as string or text variables.
This causes problems when you are trying to use such variable to create charts or in analyses that require numeric variables to be run.
Switching between data formats in SPSS is pretty straight-forward
First, switch to the variable view
Here I will widen the label column a bit so that it’s easy to identify the variables
The variable how old are you is obviously numeric but here its being treated as string
So we click on the variable type here
Then click this little button here
Make sure the width and the decimals are correct here
The width is just the maximum number of digits to store. For age, 2 is fine and we don’t have any decimals
Let’s do the same thing with average monthly earning
Click the variable type there
Click the button, and
For body weight we need 1 decimal place…
We will do the same thing for height
That was perfect. Let’s now jump to data quality issue number 3: duplicates
The best treatment for duplicates is o just remove them altogether
But first, we have to check whether we have duplicate data or not
We can do this by checking values that are supposed to be unique across the records in the data, or unique identifiers such as emails, phone numbers or IDs generated or used when data was being collected.
We are in luck because we actually have the index variable which has unique values for each of the cases
In SPSS, there is a special functionality for identifying duplicates
On the menu at the top, click data,
Identify duplicate cases
We need to drag and drop the variable we are using to uniquely identify the cases into the box at the top. If you have a combination of variables that must be unique, you can drag them all into the box
Here I will use the index variable
Here it’s showing us that it will create a new variable called PrimaryLast. By default, it will treat the last case from a list of duplicate cases as the primary case. This means that if you have 3 cases with exactly the same index value, the first 2 cases will be regarded as duplicates and the last one being the one to keep.
You can change that by choosing First case in each group is primary
And here it says a value of zero in the Primary last variable will mean that’s a duplicate and we can delete it
That’s fine let’s click OK
The easiest way to now remove the duplicates s to sort the data by the new variable Primary last
I will sort ascending
Then we can delete all the cases with 0 on Primary Last, which if you remember means these are duplicates
We don’t need to PrimaryLast variable, so it will also have to go
Let’s go to data quality issue number 4, structural issues
Structural issues in data include typos, spelling errors, inconsistent capitalization and extra spaces.
These issues are very hard to detect by just looking at the data itself.
The best way to find out if your data has any structural issues is to summarize it
For example, I suspect that the variable are you male or female has some issues
So step number 1 will be summarizing it using frequencies. This will show us a list of all the unique values in the variable
The easiest way to do frequencies on string variables is right-click n the variable name here
Select descriptive statistics
As you can see here, we have 2 entries for the value female
It’s clear the reason we have that issue is because for some 3 cases, we types female with a lowercase f instead of an upper case f
There are multiple ways you can fix those issues. For example, you can sort the variable ascending. That will bring all values that have the same issues together and you can manually fix them there.
But perhaps the fastest way to fix structural issues to use the Find and Replace functionality
First, make sure the variable is selected here
Now, Click Edit on the menu
So here it’s showing us we will be making edits in the are you male or female variable. Great
In the find box we will type female, with a lowercase letter f
In the Replace with box, we will type Female now with capital F
Here we will check the option to match the case
In certain cases, you may need to turn on some options here
For example you may need to ensure that the string you are trying to find… matches all the contents of the cell… to avoid replacing values that may start with the word female …but then ends in another way
This is fine, s click replace all
Now all the 3 instances we found have been replaced
And close this window
In software like Excel, you can analyze categorical variables that are stored as string variables without a problem. But in SPSS, there are certain statistical procedures that do not work with string variables.
In this case, you need to convert the string variables into categorical variables with numeric codes.
That’s data quality issue number 5
For example on the variable are you male or female, we will assign a numeric value say 1 to Male and 2 to Female
The functionality that does this in SPSS is called Automatic recode
Automatic recode converts string variables into numeric variables by assigning numeric values to each of the unique values in the string variable.
We will drop both are you male or female and marital status variables on the right hand side
When we ran this procedure, new variables will be created.
So we need to specify variable names for those new variables, one by one
Select marital status
Let’s set the new name here are mstatus
Then click add new name
Then click the second variable
Lets call it sex
Click Add new name
Below here we are prompted to choose whether recoding will start with the lowest value. In our case, it’s going to recode alphabetically. That’s fine we will leave it like that
Let’s click OK
Minimize the output
Now as you can see, we have 2 new variables at the end
Let’s click the value labels button at the top so that instead of seeing ones and twos , we should seen the actual values instead
So now we can clear the 2 string variables
Data quality issue number 6, dealing with outliers
Outliers are values that are extremely lower or higher than the majority of the data.
Outliers may be invalid data for example finding 250 as a value of age.
However, they can also be valid data for example having a household earning 100,000 dollars per month when the majority of the households in the data set only get up to 10,000 dollars.
The problem with having outliers in the data set is that they tend to skew the results towards their direction. This will paint a false picture that the majority of the data is either on the much higher side or much lower side depending on the size of the outlier.
The first thing is of course to detect whether we have ay outliers in the data set
You can do this in several ways, the easiest of the being to just sort the column where you think you may have outliers
For example, let’s sort the heights, this time descending so that the largest values are at the top
There are some obvious outliers that point to invalid data here.
There are many ways you can fix these outliers.
The first way would be to find out if at all these were simply mistyped information. If you take a look at the 172 and 170 here, you will probably realize that these are in fact 1.72 and 1.7. So let’s just correct that by typing the correct values
Then you have these other values that are really just too extreme.
You can either delete the whole case or just delete the values that are clearly outliers here
But also, you can simply filter out the values when you are conducting the specific analysis that involves this variable.
I will cover more about filtering in a later video on the channel so make sure you have subscribed to the channel.
In some cases, I’s not very easy to see that you have outliers. In this case, the best way to detect them is to create a box plot
But before you can create a box plot, you must assign the correct measurement type for the variables.
I have talked in length about measurement types in the video exploring relationships between variables. The link is right at the corner but also in the description below.
Lets switch to the variable view
Variables how old are you, body weight and height are all scale variables.
So I will assign them as such
Now let’s create a bar chart for the variable body weight
Go to Graphs
On the chart gallery at the bottom, select the Boxplot category
Grab the first boxplot type – the simple box plot – and drop it on the canvas
Then grab the body weight variable onto the y axis
This chart is showing us that cases number 83 and 88 are outliers
Let’s go to the data set
Scrolling down to case 83
You will notice that indeed cases 83 and 88 have crossed the 100 mark.
These may well be valid values but they are only 2 isolated cases that are extremely higher than the majority of the cases in the data set.
So how do you treat them? We can either filter them out when running some analysis, or delete the cases completely from the data set
Finally, data quality issue number 7 is missing values
Missing values can cause biased results from your analyses.
Unfortunately, there is no one best method for fixing missing data.
You have the choice of deleting all observations that have missing data, especially on variables that are most important to your analysis. Unfortunately this means that you may lose a lot of data in the process.
The other option is to fill in the missing values using either statistically calculated values such as the average within the field or patterns based on other observations in your data set.
The downside to this approach is that you risk losing the integrity of the data since you are using data that you are just assuming and not exact observations.
The other option is to leave the missing values in the data set and just find a way to report your output independent of them.
Your choice of the way to deal with missing values largely depends on your individual use case.
Let’s say that you choose to impute missing values using the average
I will use body weight as an example
First, we need to calculate the average of body weight
Just right-click the variable and select descriptive statistics…
Great. So the average is 72.27
Then, you can simple sort the data in body weight ascending to show the missing values at the top
We see that the first case at the top is missing
Just type our mean value 72.27 on that spot
So that’s it on how to clean data using SPSS.
If you found the video valuable then you will love the other great videos I will be publishing very soon and you don’t want to miss out. So go ahead and click the subscribe button.