Title: | Lightning Fast Serialization of Data Frames |
---|---|
Description: | Multithreaded serialization of compressed data frames using the 'fst' format. The 'fst' format allows for full random access of stored data and a wide range of compression settings using the LZ4 and ZSTD compressors. |
Authors: | Mark Klik [aut, cre, cph] |
Maintainer: | Mark Klik <[email protected]> |
License: | AGPL-3 | file LICENSE |
Version: | 0.9.9 |
Built: | 2024-12-28 05:25:25 UTC |
Source: | https://github.com/fstpackage/fst |
Multithreaded serialization of compressed data frames using the 'fst' format. The 'fst' format allows for random access of stored data which can be compressed with the LZ4 and ZSTD compressors.
The fst package is based on three C++ libraries:
fstlib: library containing code to write, read and compute on files stored in the fst format. Written and owned by Mark Klik.
LZ4: library containing code to compress data with the LZ4 compressor. Written and owned by Yann Collet.
ZSTD: library containing code to compress data with the ZSTD compressor. Written by Yann Collet and owned by Facebook, Inc.
As of version 0.9.8, these libraries are included in the fstcore package, on which fst depends. The copyright notices of the above libraries can be found in the fstcore package.
Maintainer: Mark Klik [email protected] [copyright holder]
Useful links:
Compress a raw vector using the LZ4 or ZSTD compressor.
compress_fst(x, compressor = "ZSTD", compression = 0, hash = FALSE)
compress_fst(x, compressor = "ZSTD", compression = 0, hash = FALSE)
x |
raw vector. |
compressor |
compressor to use for compressing |
compression |
compression factor used. Must be in the range 0 (lowest compression) to 100 (maximum compression). |
hash |
Compute hash of compressed data. This hash is stored in the resulting raw vector and can be used during decompression to check the validity of the compressed vector. Hash computation is done with the very fast xxHash algorithm and implemented as a parallel operation, so the performance hit will be very small. |
Decompress a raw vector with compressed data.
decompress_fst(x)
decompress_fst(x)
x |
raw vector with data previously compressed with |
a raw vector with previously compressed data.
Create a fst_table object that can be accessed like a regular data frame. This object is just a reference to the actual data and requires only a small amount of memory. When data is accessed, only a subset is read from file, depending on the minimum and maximum requested row number. This is possible because the fst file format allows full random access (in columns and rows) to the stored dataset.
fst(path, old_format = FALSE)
fst(path, old_format = FALSE)
path |
path to fst file |
old_format |
must be FALSE, the old fst file format is deprecated and can only be read and converted with fst package versions 0.8.0 to 0.8.10. |
An object of class fst_table
## Not run: # generate a sample fst file path <- paste0(tempfile(), ".fst") write_fst(iris, path) # create a fst_table object that can be used as a data frame ft <- fst(path) # print head and tail print(ft) # select columns and rows x <- ft[10:14, c("Petal.Width", "Species")] # use the common list interface ft[TRUE] ft[c(TRUE, FALSE)] ft[["Sepal.Length"]] ft$Petal.Length # use data frame generics nrow(ft) ncol(ft) dim(ft) dimnames(ft) colnames(ft) rownames(ft) names(ft) ## End(Not run)
## Not run: # generate a sample fst file path <- paste0(tempfile(), ".fst") write_fst(iris, path) # create a fst_table object that can be used as a data frame ft <- fst(path) # print head and tail print(ft) # select columns and rows x <- ft[10:14, c("Petal.Width", "Species")] # use the common list interface ft[TRUE] ft[c(TRUE, FALSE)] ft[["Sepal.Length"]] ft$Petal.Length # use data frame generics nrow(ft) ncol(ft) dim(ft) dimnames(ft) colnames(ft) rownames(ft) names(ft) ## End(Not run)
Parallel calculation of the hash of a raw vector
hash_fst(x, seed = NULL, block_hash = TRUE)
hash_fst(x, seed = NULL, block_hash = TRUE)
x |
raw vector that you want to hash |
seed |
The seed value for the hashing algorithm. If NULL, a default seed will be used. |
block_hash |
If TRUE, a multi-threaded implementation of the 64-bit xxHash algorithm will be used. Note that block_hash = TRUE or block_hash = FALSE will result in different hash values. |
hash value
Method for checking basic properties of the dataset stored in path
.
metadata_fst(path, old_format = FALSE) fst.metadata(path, old_format = FALSE)
metadata_fst(path, old_format = FALSE) fst.metadata(path, old_format = FALSE)
path |
path to fst file |
old_format |
must be FALSE, the old fst file format is deprecated and can only be read and converted with fst package versions 0.8.0 to 0.8.10. |
Returns a list with meta information on the stored dataset in path
.
Has class fstmetadata
.
# Sample dataset x <- data.frame( First = 1:10, Second = sample(c(TRUE, FALSE, NA), 10, replace = TRUE), Last = sample(LETTERS, 10) ) # Write to fst file fst_file <- tempfile(fileext = ".fst") write_fst(x, fst_file) # Display meta information metadata_fst(fst_file)
# Sample dataset x <- data.frame( First = 1:10, Second = sample(c(TRUE, FALSE, NA), 10, replace = TRUE), Last = sample(LETTERS, 10) ) # Write to fst file fst_file <- tempfile(fileext = ".fst") write_fst(x, fst_file) # Display meta information metadata_fst(fst_file)
For parallel operations, the performance is determined to a great extend by the number of threads
used. More threads will allow the CPU to perform more computational intensive tasks simultaneously,
speeding up the operation. Using more threads also introduces some overhead that will scale with the
number of threads used. Therefore, using the maximum number of available threads is not always the
fastest solution. With threads_fst
the number of threads can be adjusted to the users
specific requirements. As a default, fst
uses a number of threads equal to the number of
logical cores in the system.
threads_fst(nr_of_threads = NULL, reset_after_fork = NULL)
threads_fst(nr_of_threads = NULL, reset_after_fork = NULL)
nr_of_threads |
number of threads to use or |
reset_after_fork |
when |
The number of threads can also be set with options(fst_threads = N)
.
NOTE: This option is only read when the package's namespace is first loaded, with commands like
library
, require
, or ::
. If you have already used one of these, you
must use threads_fst
to set the number of threads.
the number of threads (previously) used
Read and write data frames from and to a fast-storage ('fst') file. Allows for compression and (file level) random access of stored data, even for compressed datasets. Multiple threads are used to obtain high (de-)serialization speeds but all background threads are re-joined before 'write_fst' and 'read_fst' return (reads and writes are stable). When using a 'data.table' object for 'x', the key (if any) is preserved, allowing storage of sorted data. Methods 'read_fst' and 'write_fst' are equivalent to 'read.fst' and 'write.fst' (but the former syntax is preferred).
write_fst(x, path, compress = 50, uniform_encoding = TRUE) write.fst(x, path, compress = 50, uniform_encoding = TRUE) read_fst( path, columns = NULL, from = 1, to = NULL, as.data.table = FALSE, old_format = FALSE ) read.fst( path, columns = NULL, from = 1, to = NULL, as.data.table = FALSE, old_format = FALSE )
write_fst(x, path, compress = 50, uniform_encoding = TRUE) write.fst(x, path, compress = 50, uniform_encoding = TRUE) read_fst( path, columns = NULL, from = 1, to = NULL, as.data.table = FALSE, old_format = FALSE ) read.fst( path, columns = NULL, from = 1, to = NULL, as.data.table = FALSE, old_format = FALSE )
x |
a data frame to write to disk |
path |
path to fst file |
compress |
value in the range 0 to 100, indicating the amount of compression to use. Lower values mean larger file sizes. The default compression is set to 50. |
uniform_encoding |
If 'TRUE', all character vectors will be assumed to have elements with equal encoding. The encoding (latin1, UTF8 or native) of the first non-NA element will used as encoding for the whole column. This will be a correct assumption for most use cases. If 'uniform.encoding' is set to 'FALSE', no such assumption will be made and all elements will be converted to the same encoding. The latter is a relatively expensive operation and will reduce write performance for character columns. |
columns |
Column names to read. The default is to read all columns. |
from |
Read data starting from this row number. |
to |
Read data up until this row number. The default is to read to the last row of the stored dataset. |
as.data.table |
If TRUE, the result will be returned as a |
old_format |
must be FALSE, the old fst file format is deprecated and can only be read and converted with fst package versions 0.8.0 to 0.8.10. |
'read_fst' returns a data frame with the selected columns and rows. 'write_fst' writes 'x' to a 'fst' file and invisibly returns 'x' (so you can use this function in a pipeline).
# Sample dataset x <- data.frame(A = 1:10000, B = sample(c(TRUE, FALSE, NA), 10000, replace = TRUE)) # Default compression fst_file <- tempfile(fileext = ".fst") write_fst(x, fst_file) # filesize: 17 KB y <- read_fst(fst_file) # read fst file # Maximum compression write_fst(x, fst_file, 100) # fileSize: 4 KB y <- read_fst(fst_file) # read fst file # Random access y <- read_fst(fst_file, "B") # read selection of columns y <- read_fst(fst_file, "A", 100, 200) # read selection of columns and rows
# Sample dataset x <- data.frame(A = 1:10000, B = sample(c(TRUE, FALSE, NA), 10000, replace = TRUE)) # Default compression fst_file <- tempfile(fileext = ".fst") write_fst(x, fst_file) # filesize: 17 KB y <- read_fst(fst_file) # read fst file # Maximum compression write_fst(x, fst_file, 100) # fileSize: 4 KB y <- read_fst(fst_file) # read fst file # Random access y <- read_fst(fst_file, "B") # read selection of columns y <- read_fst(fst_file, "A", 100, 200) # read selection of columns and rows