GAIN MORE SKILLS: 60% off the DATAFORDEV PRO course bundle with 16+ courses. Ends 31st July→

7 data quality issues and how to clean them in SPSS

March 25, 2021

Your ability to trust the results from your data largely depends on the quality of the data. It’s garbage-in, garbage-out! In this tutorial, I will demonstrate to you how to do data cleaning in SPSS, from removing irrelevant cases, to converting data types, detecting and removing duplicates, fixing structural issues like typos, generating value sets from text variables, fixing outliers and dealing with missing values.

Video transcript

You probably have heard the term garbage in garbage out. Well, that’s the truth about working with data.

If you use garbage data, you get garbage results. And no one wants to make decisions based on trashy data.

This is why data cleaning is an extremely important step in data analytics.

Data quality issues are very common in data that has been collected through surveys, or imported from other formats for example databases or Microsoft Excel worksheets.

In this video, I will point you to data quality issues you need to look out for, and how you can fix them using SPSS.

First, what is data cleaning?

Well, data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

With that definition, you should have an idea already about what is involved in data cleaning.

That’s what I am going to be showing you in this video

Starting with data quality issue number 1: Irrelevant data

Data can be irrelevant if it is not of interest to the analysis you are trying to do.

So this could be either irrelevant variables or irrelevant cases.

I have here this data from a fictitious survey I conducted using a data collection platform called KoBo Toolbox.

Already you can see that we have variables that were automatically generated by the platform for example the start and end time of the interview here.

If you scroll to the end

You will also notice that w have more of these for example the id, uuid, submission time, validation status and so on

Dealing with irrelevant variables is simple. Just delete them

I will scroll to the beginning

Right-click on the variable name

And select clear

That’s it!

Lets do the same with the end time variable as well

So right-click

Clear

We can also delete multiple variables at the same time

Let’s select from @iD to @tags

There are several was you can do that.

But the easiest and most consistent way is to click the first variable which in this case is @ID

The press the control key on the keyboard and hold it

While the control key is still pressed, go ahead and click all the other columns

Then right-click

And select clear

I left the index variable deliberately because we may need it to uniquely identify each case in the data set

While we are talking about irrelevant data, let’s look at how we may remove irrelevant cases

Let’s suppose that we are only interested in analyzing data for those who completed tertiary education

There are multiple ways to do this, but the easiest is to simply sort the data by the said variable, and delete the rest of the cases that are not of interest to your analysis.

I must say though that it’s wise to backup the data before you delete anything from your dataset.

So in this case, we will right-click the What is your highest level of education column,

Choose sort ascending

Great

Now quickly click and drag from case number 1 to case number 10

Now right-click

Select clear

Now in this case, the variable what is your highest education level is useless wince we only have 1 value. So let’s go ahead and delete it

Data quality issue number 2 is incorrect data formats

It is very common for data tat was imported from other data software to be improperly formatted in SPSS.

The most common one is having numeric values being designated as string or text variables.

This causes problems when you are trying to use such variable to create charts or in analyses that require numeric variables to be run.

Switching between data formats in SPSS is pretty straight-forward

First, switch to the variable view

Here I will widen the label column a bit so that it’s easy to identify the variables

The variable how old are you is obviously numeric but here its being treated as string

So we click on the variable type here

Then click this little button here

Select numeric

Make sure the width and the decimals are correct here

The width is just the maximum number of digits to store. For age, 2 is fine and we don’t have any decimals

Click OK

Let’s do the same thing with average monthly earning

Click the variable type there

Click the button, and

select numeric

click OK

For body weight we need 1 decimal place…

We will do the same thing for height

That was perfect. Let’s now jump to data quality issue number 3: duplicates

The best treatment for duplicates is o just remove them altogether

But first, we have to check whether we have duplicate data or not

We can do this by checking values that are supposed to be unique across the records in the data, or unique identifiers such as emails, phone numbers or IDs generated or used when data was being collected.

We are in luck because we actually have the index variable which has unique values for each of the cases

In SPSS, there is a special functionality for identifying duplicates

On the menu at the top, click data,

Identify duplicate cases

We need to drag and drop the variable we are using to uniquely identify the cases into the box at the top. If you have a combination of variables that must be unique, you can drag them all into the box

Here I will use the index variable

Here it’s showing us that it will create a new variable called PrimaryLast. By default, it will treat the last case from a list of duplicate cases as the primary case. This means that if you have 3 cases with exactly the same index value, the first 2 cases will be regarded as duplicates and the last one being the one to keep.

You can change that by choosing First case in each group is primary

And here it says a value of zero in the Primary last variable will mean that’s a duplicate and we can delete it

That’s fine let’s click OK

The easiest way to now remove the duplicates s to sort the data by the new variable Primary last

I will sort ascending

Then we can delete all the cases with 0 on Primary Last, which if you remember means these are duplicates

We don’t need to PrimaryLast variable, so it will also have to go

Let’s go to data quality issue number 4, structural issues

Structural issues in data include typos, spelling errors, inconsistent capitalization and extra spaces.

These issues are very hard to detect by just looking at the data itself.

The best way to find out if your data has any structural issues is to summarize it

For example, I suspect that the variable are you male or female has some issues

So step number 1 will be summarizing it using frequencies. This will show us a list of all the unique values in the variable

The easiest way to do frequencies on string variables is right-click n the variable name here

Select descriptive statistics

As you can see here, we have 2 entries for the value female

It’s clear the reason we have that issue is because for some 3 cases, we types female with a lowercase f instead of an upper case f

There are multiple ways you can fix those issues. For example, you can sort the variable ascending. That will bring all values that have the same issues together and you can manually fix them there.

But perhaps the fastest way to fix structural issues to use the Find and Replace functionality

First, make sure the variable is selected here

Now, Click Edit on the menu

Click replace

So here it’s showing us we will be making edits in the are you male or female variable. Great

In the find box we will type female, with a lowercase letter f

In the Replace with box, we will type Female now with capital F

Here we will check the option to match the case

In certain cases, you may need to turn on some options here

For example you may need to ensure that the string you are trying to find… matches all the contents of the cell… to avoid replacing values that may start with the word female …but then ends in another way

This is fine, s click replace all

Now all the 3 instances we found have been replaced

Click ok

And close this window

In software like Excel, you can analyze categorical variables that are stored as string variables without a problem. But in SPSS, there are certain statistical procedures that do not work with string variables.

In this case, you need to convert the string variables into categorical variables with numeric codes.

That’s data quality issue number 5

For example on the variable are you male or female, we will assign a numeric value say 1 to Male and 2 to Female

The functionality that does this in SPSS is called Automatic recode

Automatic recode converts string variables into numeric variables by assigning numeric values to each of the unique values in the string variable.

Click Transform

Automatic recode

We will drop both are you male or female and marital status variables on the right hand side

When we ran this procedure, new variables will be created.

So we need to specify variable names for those new variables, one by one

Select marital status

Let’s set the new name here are mstatus

Then click add new name

Then click the second variable

Lets call it sex

Click Add new name

Below here we are prompted to choose whether recoding will start with the lowest value. In our case, it’s going to recode alphabetically. That’s fine we will leave it like that

Let’s click OK

Minimize the output

Now as you can see, we have 2 new variables at the end

Let’s click the value labels button at the top so that instead of seeing ones and twos , we should seen the actual values instead

Perfect

So now we can clear the 2 string variables

Data quality issue number 6, dealing with outliers

Outliers are values that are extremely lower or higher than the majority of the data.

Outliers may be invalid data for example finding 250 as a value of age.

However, they can also be valid data for example having a household earning 100,000 dollars per month when the majority of the households in the data set only get up to 10,000 dollars.

The problem with having outliers in the data set is that they tend to skew the results towards their direction. This will paint a false picture that the majority of the data is either on the much higher side or much lower side depending on the size of the outlier.

The first thing is of course to detect whether we have ay outliers in the data set

You can do this in several ways, the easiest of the being to just sort the column where you think you may have outliers

For example, let’s sort the heights, this time descending so that the largest values are at the top

There are some obvious outliers that point to invalid data here.

There are many ways you can fix these outliers.

The first way would be to find out if at all these were simply mistyped information. If you take a look at the 172 and 170 here, you will probably realize that these are in fact 1.72 and 1.7. So let’s just correct that by typing the correct values

Then you have these other values that are really just too extreme.

You can either delete the whole case or just delete the values that are clearly outliers here

But also, you can simply filter out the values when you are conducting the specific analysis that involves this variable.

I will cover more about filtering in a later video on the channel so make sure you have subscribed to the channel.

In some cases, I’s not very easy to see that you have outliers. In this case, the best way to detect them is to create a box plot

But before you can create a box plot, you must assign the correct measurement type for the variables.

I have talked in length about measurement types in the video exploring relationships between variables. The link is right at the corner but also in the description below.

Lets switch to the variable view

Variables how old are you, body weight and height are all scale variables.

So I will assign them as such

Now let’s create a bar chart for the variable body weight

Go to Graphs

Chart builder

On the chart gallery at the bottom, select the Boxplot category

Grab the first boxplot type – the simple box plot – and drop it on the canvas

Then grab the body weight variable onto the y axis

Click OK

This chart is showing us that cases number 83 and 88 are outliers

Let’s go to the data set

Scrolling down to case 83

You will notice that indeed cases 83 and 88 have crossed the 100 mark.

These may well be valid values but they are only 2 isolated cases that are extremely higher than the majority of the cases in the data set.

So how do you treat them? We can either filter them out when running some analysis, or delete the cases completely from the data set

Finally, data quality issue number 7 is missing values

Missing values can cause biased results from your analyses.

Unfortunately, there is no one best method for fixing missing data.

You have the choice of deleting all observations that have missing data, especially on variables that are most important to your analysis. Unfortunately this means that you may lose a lot of data in the process.

The other option is to fill in the missing values using either statistically calculated values such as the average within the field or patterns based on other observations in your data set.

The downside to this approach is that you risk losing the integrity of the data since you are using data that you are just assuming and not exact observations.

The other option is to leave the missing values in the data set and just find a way to report your output independent of them.

Your choice of the way to deal with missing values largely depends on your individual use case.

Let’s say that you choose to impute missing values using the average

I will use body weight as an example

First, we need to calculate the average of body weight

That’s easy

Just right-click the variable and select descriptive statistics…

Great. So the average is 72.27

Then, you can simple sort the data in body weight ascending to show the missing values at the top

We see that the first case at the top is missing

Just type our mean value 72.27 on that spot

That’s sorted

So that’s it on how to clean data using SPSS.

If you found the video valuable then you will love the other great videos I will be publishing very soon and you don’t want to miss out. So go ahead and click the subscribe button.

Learn SPSS, STATA, KoboToolbox, Power BI and more

Join DATAFORDEV for self-paced courses, live online workshops, and a private community forum for social impact professionals.

Join now

7 data quality issues and how to clean them in SPSS

Video transcript

Learn SPSS, STATA, KoboToolbox, Power BI and more

Company

Courses

Resources