R performance tests

programming
Author

Patrick Baylis

Published

February 3, 2017

Warning

Update (July 2020): These tests are quite dated at this point, YMMV. Software changes quickly!

R date conversion speed test (as.IDate vs fast_strptime)

require(data.table)
require(lubridate)

n <- 10000000
x <- rep("2014-01-01", n)

system.time(r1 <- as.IDate(x, format="%Y%m%d"))
system.time(r2 <- as.IDate(parse_date_time(x, orders="%Y%m%d", exact=T)))
system.time(r3 <- as.IDate(fast_strptime(x, format="%Y%m%d")))

Winner: fast_strptime by a factor of two over the IDate parser (which is also the Date parser?).

Pattern matching (grep vs str)

What’s the fastest way to match strings? This code compares grep to stri_detect_* (from the stringi package), considering both fixed and regex matching.

library(microbenchmark)
library(stringi)
library(ggplot2)

R <- 100000
g <- replicate(R, paste0(sample(c(letters[1:5]," "), 10, replace=TRUE),
                               collapse=""))

m <- microbenchmark(
  grep(" ", g ),
  stri_detect_regex(g, " "),
  grep(" ", g, perl=TRUE),
  grep(" ", g, fixed=TRUE),
  stri_detect_fixed(g, " ")
)
autoplot(m)

png

Results are similar for gsub. For a comparison of stringi to stringr, see here.

See also here for improving grep performance.

Read CSV (fread vs read_csv)

I use fread (from the data.table package) for my day-to-day data munging in R, but occasionally read_csv (from the readr package) is more useful, for example when CSVs are formatted in a tricky way or when I’d prefer to have dates read in automatically. It’s helpful to know what kind of performance tradeoff I’m making. Following code tests timings on reading both character and numeric vectors. Timings in comments in seconds.

library(data.table)
library(readr)
library(stringi)

# Create test dataframes
n <- 10000000
df1 <- data.frame(x=stri_rand_strings(n, 5, '[A-Z]'))
df1$x <-as.character(df1$x)
df2 <- data.frame(x=round(rnorm(n), 3))

dt1 <- data.table(df1)
dt2 <- data.table(df2)

system.time(write_csv(df1, "dt1_df.csv")) # 3.8
system.time(write_csv(df2, "dt2_df.csv")) # 3.1
system.time(fwrite(dt1, "dt1_dt.csv")) # 0.6
system.time(fwrite(dt2, "dt2_dt.csv")) # 1.3

system.time(in.df1 <- read_csv("dt1_df.csv")) # 4.9
system.time(in.df2 <- read_csv("dt2_df.csv")) # 2.2
system.time(in.dt1 <- fread("dt1_dt.csv")) # 2.7
system.time(in.dt2 <- fread("dt2_dt.csv")) # 1.0

So data.table is about three times as fast at writing and two times at fast at reading.

Write CSV (fwrite vs write_csv)

Unsupported anecdotal claim: fwrite is faster.