It can sometimes be tricky to find out and eliminate participants that produce missing data in an analysis of interest. Most often you want to do an ANOVA with 2-3 factors and you receive the error message “One more more cells are missing data”. This happens when not all variables are experimentally controlled but are subject to the participants performance. For example, you want to see how many errors (true/false) participants do over the course of an experiment (block1-10). Now if some participant did not make any errors in a given block, you will run into missing data problems in follow-up ANOVAs or paired t-tests.
I found that an easy and very understandable way of approaching this is the following:
So let’s say we want to investigate whether the amount of errors a participant made changed over the course of an experiment (which is divided in 2 blocks) with the ANOVA: Error X Block.
S error block 1 FALSE 1 1 TRUE 1 1 FALSE 1 1 FaLSE 2 1 TRUE 2 1 FaLSE 2 2 FaLSE 1 2 TRUE 1 2 FaLSE 1 2 FaLSE 2 2 TRUE 2 2 FaLSE 2 3 FaLSE 1 3 TRUE 1 3 FaLSE 1 3 FaLSE 2 3 FaLSE 2 3 FaLSE 2
First, we need to summarize the data, which we do using the ddply command.
# First we need to aggregate the data library(plyr) agg <- ddply(data, ~ S + block + error, .fun=summarise, count = length(error))
We now get for each subject, the count of errors and non-errors in each block.
# Aggregated data S block error count 1 1 FALSE 2 1 1 TRUE 1 1 2 FALSE 2 1 2 TRUE 1 2 1 FALSE 2 2 1 TRUE 1 2 2 FALSE 2 2 2 TRUE 1 3 1 FALSE 2 3 1 TRUE 1 3 2 FALSE 3
What you can easily spot is that there are only three rows of data for participant 3. This is because he never produced errors in block 3. In larger files, we might miss that. This is why we then count the rows of data (number of conditions) for each participant.
agg_conditions <- ddply(agg, ~ S, .fun=summarise, cond = length(count))
You will get the following output:
# Number of conditions per subject S cond 1 4 2 4 3 3
Next, instead excluding participant 3 manually, we make a list of all subjects that have a lower value of cond (less conditions) than the maximum number of conditions.
exclude <- unique(agg_conditions$S[agg_conditions$cond < max(agg_conditions$cond)])
You can now print out exclude to see a nice list of every subject ID you excluded (because you will mention that in the article of course!). Further, we automatically exclude these subjects from the dataset by the following command.
agg <- subset(agg, !(agg$S %in% exclude))
This is it. Easy, wasn’t it? And totally works for high numbers of conditions and subjects (when the manual approach is way too much workload).
Below is the full script:
# Full script: How to exclude participants that lac data in any one condition library(plyr) agg <- ddply(data, ~ S + block + error, .fun=summarise, count = length(error)) agg_conditions <- ddply(agg, ~ S, .fun=summarise, cond = length(count)) exclude <- unique(agg_conditions$S[agg_conditions$cond < max(agg_conditions$cond)]) agg <- subset(agg, !(agg$S %in% exclude))
Leave a Reply