The melt
and dcast
functions for
data.tables are for reshaping wide-to-long and long-to-wide,
respectively
The extended functionalities are in line with data.table’s philosophy of performing operations efficiently and in a straightforward manner.
s1 <- “family_id age_mother dob_child1 dob_child2 dob_child3 1 30
1998-11-26 2000-01-29 NA 2 27 1996-06-22 NA NA 3 26 2002-07-11
2004-04-05 2007-09-02 4 32 2004-10-10 2009-08-27 2012-07-21 5 29
2000-12-05 2005-02-28 NA” DT <- fread(s1) DT # family_id age_mother
dob_child1 dob_child2 dob_child3 # 1: 1 30 1998-11-26 2000-01-29
str(DT) # Classes ‘data.table’ and ‘data.frame’: 5 obs. of 5
variables: # $ family_id : int 1 2 3 4 5 # $ age_mother: int 30 27 26 32
29 # $ dob_child1: IDate, format: “1998-11-26” “1996-06-22” “2002-07-11”
… # $ dob_child2: IDate, format: “2000-01-29” NA “2004-04-05” … # $
dob_child3: IDate, format: NA NA “2007-09-02” … # - attr(*,
“.internal.selfref”)=
DT.m1 = melt(DT, id.vars = c(“family_id”, “age_mother”), measure.vars
= c(“dob_child1”, “dob_child2”, “dob_child3”)) DT.m1 # family_id
age_mother variable value # 1: 1 30 dob_child1 1998-11-26 # 2: 2 27
dob_child1 1996-06-22 # 3: 3 26 dob_child1 2002-07-11 # 4: 4 32
dob_child1 2004-10-10 # 5: 5 29 dob_child1 2000-12-05 # 6: 1 30
dob_child2 2000-01-29 # 7: 2 27 dob_child2
We can also specify column indices instead of names.
By default, variable column is of type factor. Set variable.factor argument to FALSE if you’d like to return a character vector instead.
By default, the molten columns are automatically named variable and value.
melt preserves column attributes in result.
When neither id.vars nor measure.vars are specified, as mentioned under ?melt, all non-numeric, integer, logical columns will be assigned to id.vars.
In addition, a warning message is issued highlighting the columns that are automatically considered to be id.vars.
dcast(DT.m1, family_id + age_mother ~ child, value.var = “dob”) #
family_id age_mother dob_child1 dob_child2 dob_child3 # 1: 1 30
1998-11-26 2000-01-29
value.var denotes the column to be filled in with while casting to wide format.
dcast also tries to preserve attributes in result wherever possible.
dcast(DT.m1, family_id ~ ., fun.agg = function(x) sum(!is.na(x)), value.var = “dob”) # family_id . # 1: 1 2 # 2: 2 1 # 3: 3 3 # 4: 4 3 # 5: 5 2 Check ?dcast for other useful arguments and additional examples.
However, there are situations we might run into where the desired operation is not expressed in a straightforward manner. For example, consider the data.table shown below:
s2 <- “family_id age_mother dob_child1 dob_child2 dob_child3
gender_child1 gender_child2 gender_child3 1 30 1998-11-26 2000-01-29 NA
1 2 NA 2 27 1996-06-22 NA NA 2 NA NA 3 26 2002-07-11 2004-04-05
2007-09-02 2 2 1 4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1 5 29
2000-12-05 2005-02-28 NA 2 1 NA” DT <- fread(s2) DT # family_id
age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2
gender_child3 # 1: 1 30 1998-11-26 2000-01-29
DT.m1 = melt(DT, id = c(“family_id”, “age_mother”)) DT.m1[,
c(“variable”, “child”) := tstrsplit(variable, “_“, fixed = TRUE)] DT.c1
= dcast(DT.m1, family_id + age_mother + child ~ variable, value.var =
“value”) DT.c1 # family_id age_mother child dob gender # 1: 1 30 child1
1998-11-26 1970-01-02 # 2: 1 30 child2 2000-01-29 1970-01-03 # 3: 1 30
child3
str(DT.c1) ## gender column is character type now! # Classes
‘data.table’ and ‘data.frame’: 15 obs. of 5 variables: # $ family_id :
int 1 1 1 2 2 2 3 3 3 4 … # $ age_mother: int 30 30 30 27 27 27 26 26 26
32 … # $ child : chr “child1” “child2” “child3” “child1” … # $ dob :
IDate, format: “1998-11-26” “2000-01-29” NA … # $ gender : IDate,
format: “1970-01-02” “1970-01-03” NA … # - attr(,
“.internal.selfref”)=
As an analogy, imagine you’ve a closet with four shelves of clothes and you’d like to put together the clothes from shelves 1 and 2 together (in 1), and 3 and 4 together (in 3). What we are doing is more or less to combine all the clothes together, and then split them back on to shelves 1 and 3!
The columns to melt may be of different types, as in this case (character and integer types). By melting them all together, the columns will be coerced in result, as explained by the warning message above and shown from output of str(DT.c1), where gender has been converted to character type.
We are generating an additional column by splitting the variable column into two columns, whose purpose is quite cryptic. We do it because we need it for casting in the next step.
Finally, we cast the data set. But the issue is it’s a much more computationally involved operation than melt. Specifically, it requires computing the order of the variables in formula, and that’s costly.
In fact, stats::reshape is capable of performing this operation in a very straightforward manner. It is an extremely useful and often underrated function. You should definitely give it a try!
colA = paste(“dob_child”, 1:3, sep = ““) colB = paste(”gender_child”,
1:3, sep = ““) DT.m2 = melt(DT, measure = list(colA, colB), value.name =
c(”dob”, “gender”)) DT.m2 # family_id age_mother variable dob gender #
1: 1 30 1 1998-11-26 1 # 2: 2 27 1 1996-06-22 2 # 3: 3 26 1 2002-07-11 2
# 4: 4 32 1 2004-10-10 1 # 5: 5 29 1 2000-12-05 2 # 6: 1 30 2 2000-01-29
2 # 7: 2 27 2
str(DT.m2) ## col type is preserved # Classes ‘data.table’ and
‘data.frame’: 15 obs. of 5 variables: # $ family_id : int 1 2 3 4 5 1 2
3 4 5 … # $ age_mother: int 30 27 26 32 29 30 27 26 32 29 … # $ variable
: Factor w/ 3 levels “1”,“2”,“3”: 1 1 1 1 1 2 2 2 2 2 … # $ dob : IDate,
format: “1998-11-26” “1996-06-22” “2002-07-11” … # $ gender : int 1 2 2
1 2 2 NA 2 1 1 … # - attr(*, “.internal.selfref”)=
DT.m2 = melt(DT, measure = patterns(“^dob”, “^gender”), value.name =
c(“dob”, “gender”)) DT.m2 # family_id age_mother variable dob gender #
1: 1 30 1 1998-11-26 1 # 2: 2 27 1 1996-06-22 2 # 3: 3 26 1 2002-07-11 2
# 4: 4 32 1 2004-10-10 1 # 5: 5 29 1 2000-12-05 2 # 6: 1 30 2 2000-01-29
2 # 7: 2 27 2
We can remove the variable column if necessary.
The functionality is implemented entirely in C, and is therefore both fast and memory efficient in addition to being straightforward.
If we use the current functionality of dcast, then we’d have to cast twice and bind the results together. But that’s once again verbose, not straightforward and is also inefficient.
DT.c2 = dcast(DT.m2, family_id + age_mother ~ variable, value.var =
c(“dob”, “gender”)) DT.c2 # family_id age_mother dob_1 dob_2 dob_3
gender_1 gender_2 gender_3 # 1: 1 30 1998-11-26 2000-01-29
Everything is taken care of internally, and efficiently. In addition to being fast, it is also very memory efficient.
Multiple functions to fun.aggregate: You can also provide multiple functions to fun.aggregate to dcast for data.tables. Check the examples in ?dcast which illustrates this functionality.