
Fast (Grouped, Weighted) Sum for Matrix-Like Objects
fsum.Rd
fsum
is a generic function that computes the (column-wise) sum of all values in x
, (optionally) grouped by g
and/or weighted by w
(e.g. to calculate survey totals). The TRA
argument can further be used to transform x
using its (grouped, weighted) sum.
Usage
fsum(x, ...)
# S3 method for default
fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, fill = FALSE, nthreads = .op[["nthreads"]], ...)
# S3 method for matrix
fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, fill = FALSE, nthreads = .op[["nthreads"]], ...)
# S3 method for data.frame
fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, fill = FALSE, nthreads = .op[["nthreads"]], ...)
# S3 method for grouped_df
fsum(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE,
keep.w = TRUE, fill = FALSE, nthreads = .op[["nthreads"]], ...)
Arguments
- x
a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df').
- g
a factor,
GRP
object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to aGRP
object) used to groupx
.- w
a numeric vector of (non-negative) weights, may contain missing values.
- TRA
an integer or quoted operator indicating the transformation to perform: 0 - "NA" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See
TRA
.- na.rm
logical. Skip missing values in
x
. Defaults toTRUE
and implemented at very little computational cost. Ifna.rm = FALSE
aNA
is returned when encountered.- use.g.names
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
- fill
logical. Initialize result with
0
instead ofNA
whenna.rm = TRUE
e.g.fsum(NA, fill = TRUE)
returns0
instead ofNA
.- nthreads
integer. The number of threads to utilize. See Details.
- drop
matrix and data.frame method: Logical.
TRUE
drops dimensions and returns an atomic vector ifg = NULL
andTRA = NULL
.- keep.group_vars
grouped_df method: Logical.
FALSE
removes grouping variables after computation.- keep.w
grouped_df method: Logical. Retain summed weighting variable after computation (if contained in
grouped_df
).- ...
arguments to be passed to or from other methods. If
TRA
is used, passingset = TRUE
will transform data by reference and return the result invisibly.
Details
The weighted sum (e.g. survey total) is computed as sum(x * w)
, but in one pass and about twice as efficient. If na.rm = TRUE
, missing values will be removed from both x
and w
i.e. utilizing only x[complete.cases(x,w)]
and w[complete.cases(x,w)]
.
This all seamlessly generalizes to grouped computations, which are performed in a single pass (without splitting the data) and are therefore extremely fast. See Benchmark and Examples below.
When applied to data frames with groups or drop = FALSE
, fsum
preserves all column attributes. The attributes of the data frame itself are also preserved.
Since v1.6.0 fsum
explicitly supports integers. Integers are summed using the long long type in C which is bounded at +-9,223,372,036,854,775,807 (so ~4.3 billion times greater than the minimum/maximum R integer bounded at +-2,147,483,647). If the value of the sum is outside +-2,147,483,647, a double containing the result is returned, otherwise an integer is returned. With groups, an integer results vector is initialized, and an integer overflow error is provided if the sum in any group is outside +-2,147,483,647. Data needs to be coerced to double beforehand in such cases.
Multithreading, added in v1.8.0, applies at the column-level unless g = NULL
and nthreads > NCOL(x)
. Parallelism over groups is not available because sums are computed simultaneously within each group. nthreads = 1L
uses a serial version of the code, not parallel code running on one thread. This serial code is always used with less than 100,000 obs (length(x) < 100000
for vectors and matrices), because parallel execution itself has some overhead.
Value
The (w
weighted) sum of x
, grouped by g
, or (if TRA
is used) x
transformed by its (grouped, weighted) sum.
Examples
## default vector method
mpg <- mtcars$mpg
fsum(mpg) # Simple sum
fsum(mpg, w = mtcars$hp) # Weighted sum (total): Weighted by hp
fsum(mpg, TRA = "%") # Simple transformation: obtain percentages of mpg
fsum(mpg, mtcars$cyl) # Grouped sum
fsum(mpg, mtcars$cyl, mtcars$hp) # Weighted grouped sum (total)
fsum(mpg, mtcars[c(2,8:9)]) # More groups..
g <- GRP(mtcars, ~ cyl + vs + am) # Precomputing groups gives more speed !
fsum(mpg, g)
fmean(mpg, g) == fsum(mpg, g) / fnobs(mpg, g)
fsum(mpg, g, TRA = "%") # Percentages by group
## data.frame method
fsum(mtcars)
fsum(mtcars, TRA = "%")
fsum(mtcars, g)
fsum(mtcars, g, TRA = "%")
## matrix method
m <- qM(mtcars)
fsum(m)
fsum(m, TRA = "%")
fsum(m, g)
fsum(m, g, TRA = "%")
## method for grouped data frames - created with dplyr::group_by or fgroup_by
library(dplyr)
mtcars %>% group_by(cyl,vs,am) %>% fsum(hp) # Weighted grouped sum (total)
mtcars %>% fgroup_by(cyl,vs,am) %>% fsum(hp) # Equivalent and faster !!
mtcars %>% fgroup_by(cyl,vs,am) %>% fsum(TRA = "%")
mtcars %>% fgroup_by(cyl,vs,am) %>% fselect(mpg) %>% fsum()
Benchmark
## This compares fsum with data.table (2 threads) and base::rowsum
# Starting with small data
mtcDT <- qDT(mtcars)
f <- qF(mtcars$cyl)
library(microbenchmark)
microbenchmark(mtcDT[, lapply(.SD, sum), by = f],
rowsum(mtcDT, f, reorder = FALSE),
fsum(mtcDT, f, na.rm = FALSE), unit = "relative")
expr min lq mean median uq max neval cld
mtcDT[, lapply(.SD, sum), by = f] 145.436928 123.542134 88.681111 98.336378 71.880479 85.217726 100 c
rowsum(mtcDT, f, reorder = FALSE) 2.833333 2.798203 2.489064 2.937889 2.425724 2.181173 100 b
fsum(mtcDT, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
# Now larger data
tdata <- qDT(replicate(100, rnorm(1e5), simplify = FALSE)) # 100 columns with 100.000 obs
f <- qF(sample.int(1e4, 1e5, TRUE)) # A factor with 10.000 groups
microbenchmark(tdata[, lapply(.SD, sum), by = f],
rowsum(tdata, f, reorder = FALSE),
fsum(tdata, f, na.rm = FALSE), unit = "relative")
expr min lq mean median uq max neval cld
tdata[, lapply(.SD, sum), by = f] 2.646992 2.975489 2.834771 3.081313 3.120070 1.2766475 100 c
rowsum(tdata, f, reorder = FALSE) 1.747567 1.753313 1.629036 1.758043 1.839348 0.2720937 100 b
fsum(tdata, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a