Useful functions in “reshape2” package in R —- stack(), melt()

  In many situations, data is presented in a format that is not ready to dive straight to exploratory data analysis or to use a desired statistical method. The reshape2 package for R provides useful functionality to avoid having to hack data around in a spreadsheet prior to import into R.

  1. stack() function

For example, we have the data frame read directly from *.csv file and we’d like to perform ANOVA analysis on it.

data
A B C D
1 1.97 1.38 1.87 1.77
2 0.85 1.86 1.90 1.68
3 1.79 2.26 2.43 1.46
4 2.30 1.99 1.32 1.53
5 1.71 1.32 2.06 1.36
6 2.66 2.11 1.04 1.65
7 2.49 2.54 1.99 2.12
8 2.37 2.06 1.52 1.73

The value of each cell represents the NEC status in these 4 group (A,B,C and D). You cannot directly run aov(data), if you do, here is what you’r gonna get.

Error in terms.default(formula, “Error”) :
no terms component nor attribute

This is because you need to specify the formula, which is NEC~groups. Apparently, there is no NEC column in this data frame so far.

Since this default data frame format is not ready to use, we need to “stack” it into two columns. One is the NEC value, the other one is the group information.

data1<-stack(data)
> data1
values ind
1 1.97 A
2 0.85 A
3 1.79 A
4 2.30 A
5 1.71 A
6 2.66 A
7 2.49 A
8 2.37 A
9 1.81 A
10 2.51 A
11 2.38 A

……

Then we can use this format to dive in for further analysis.

names(data1)<-c(“NEC_status”,”Group”)
> anova_data1<-aov(NEC_status~Group, data=data1)
> summary(anova_data1)
Df Sum Sq Mean Sq F value Pr(>F)
Group 3 2.638 0.8792 4.944 0.00255 **
Residuals 174 30.942 0.1778

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
90 observations deleted due to missingness

 

Of course, there is another function, unstack(), you can use it to reverse the stack process.

According the R documentary for stack() function, it can be used for “Stack or Unstack Vectors from a Data Frame or List”. The ” select = ” value inside helps you specify which columns you wanna stack or unstack.

Examples

require(stats)

formula(PlantGrowth)         # check the default formula

pg <- unstack(PlantGrowth)   # unstack according to this formula

pg

stack(pg)                    # now put it back together

stack(pg, select = -ctrl)    # omitting one vector

stack() function can convert data into two columns, value and id. However, what if we need more than that.

For example, what if the data frame looks like this,

data
task GroupA GroupB
1 1.97 1.38
2 0.85 1.86
3 1.79 2.26
4 2.30 1.99
5 1.71 1.32
6 2.66 2.11
7 2.49 2.54
8 2.37 2.06

You need to plot the barplot of NEC status~task for these two groups side-by-side.

2. melt() function

The melt function takes data in wide format and stacks a set of columns into a single column of data. To make use of the function we need to specify a data frame, the id variables (which will be left at their settings) and the measured variables (columns of data) to be stacked.

> melt(data, id.vars = “task”)
task variable value
1 1 GroupA 1.97
2 2 GroupA 0.85
3 3 GroupA 1.79
4 4 GroupA 2.30
5 5 GroupA 1.71
6 6 GroupA 2.66
7 7 GroupA 2.49
8 8 GroupA 2.37
9 1 GroupB 1.38
10 2 GroupB 1.86
11 3 GroupB 2.26
12 4 GroupB 1.99
13 5 GroupB 1.32
14 6 GroupB 2.11
15 7 GroupB 2.54
16 8 GroupB 2.06

Then you can use “variable” column to group your data, then directly use ggplot() to barplot for these two groups.

Easy peasy. 🙂

 

Leave a comment