In many situations, data is presented in a format that is not ready to dive straight to exploratory data analysis or to use a desired statistical method. The reshape2 package for R provides useful functionality to avoid having to hack data around in a spreadsheet prior to import into R.
-
stack() function
For example, we have the data frame read directly from *.csv file and we’d like to perform ANOVA analysis on it.
data
A B C D
1 1.97 1.38 1.87 1.77
2 0.85 1.86 1.90 1.68
3 1.79 2.26 2.43 1.46
4 2.30 1.99 1.32 1.53
5 1.71 1.32 2.06 1.36
6 2.66 2.11 1.04 1.65
7 2.49 2.54 1.99 2.12
8 2.37 2.06 1.52 1.73
The value of each cell represents the NEC status in these 4 group (A,B,C and D). You cannot directly run aov(data), if you do, here is what you’r gonna get.
Error in terms.default(formula, “Error”) :
no terms component nor attribute
This is because you need to specify the formula, which is NEC~groups. Apparently, there is no NEC column in this data frame so far.
Since this default data frame format is not ready to use, we need to “stack” it into two columns. One is the NEC value, the other one is the group information.
data1<-stack(data)
> data1
values ind
1 1.97 A
2 0.85 A
3 1.79 A
4 2.30 A
5 1.71 A
6 2.66 A
7 2.49 A
8 2.37 A
9 1.81 A
10 2.51 A
11 2.38 A……
Then we can use this format to dive in for further analysis.
names(data1)<-c(“NEC_status”,”Group”)
> anova_data1<-aov(NEC_status~Group, data=data1)
> summary(anova_data1)
Df Sum Sq Mean Sq F value Pr(>F)
Group 3 2.638 0.8792 4.944 0.00255 **
Residuals 174 30.942 0.1778
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
90 observations deleted due to missingness
Of course, there is another function, unstack(), you can use it to reverse the stack process.
According the R documentary for stack() function, it can be used for “Stack or Unstack Vectors from a Data Frame or List”. The ” select = ” value inside helps you specify which columns you wanna stack or unstack.
Examples
require(stats)
formula(PlantGrowth) # check the default formula
pg <- unstack(PlantGrowth) # unstack according to this formula
pg
stack(pg) # now put it back together
stack(pg, select = -ctrl) # omitting one vector
stack() function can convert data into two columns, value and id. However, what if we need more than that.
For example, what if the data frame looks like this,
data
task GroupA GroupB
1 1.97 1.38
2 0.85 1.86
3 1.79 2.26
4 2.30 1.99
5 1.71 1.32
6 2.66 2.11
7 2.49 2.54
8 2.37 2.06
You need to plot the barplot of NEC status~task for these two groups side-by-side.
2. melt() function
The melt function takes data in wide format and stacks a set of columns into a single column of data. To make use of the function we need to specify a data frame, the id variables (which will be left at their settings) and the measured variables (columns of data) to be stacked.
> melt(data, id.vars = “task”)
task variable value
1 1 GroupA 1.97
2 2 GroupA 0.85
3 3 GroupA 1.79
4 4 GroupA 2.30
5 5 GroupA 1.71
6 6 GroupA 2.66
7 7 GroupA 2.49
8 8 GroupA 2.37
9 1 GroupB 1.38
10 2 GroupB 1.86
11 3 GroupB 2.26
12 4 GroupB 1.99
13 5 GroupB 1.32
14 6 GroupB 2.11
15 7 GroupB 2.54
16 8 GroupB 2.06
Then you can use “variable” column to group your data, then directly use ggplot() to barplot for these two groups.
Easy peasy. 🙂