About: This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform [Formula: see text] matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads.

Facets (new session)
Description
Metadata
Settings
- owl:sameAs
- Inference Rule:

About: This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform [Formula: see text] matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads. Goto Sponge NotDistinct Permalink

An Entity of Type : fabio:Abstract, within Data Space : covidontheweb.inria.fr associated with source document(s)

Attributes	Values
type	abstract
value	This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform [Formula: see text] matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads.
subject	Technology companies based in the San Francisco Bay Area Numerical software OpenCL compute devices Computer arithmetic Graphics processing units Floating point
part of	DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
is abstract of	DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
is hasSource of	covid:ann/target/3c992dced0c74cfd7f21d2ce8c784ce39d5273a2 covid:ann/target/d1e294f4bbb5e1568d76ded7a6a4eb8dd1db987a covid:ann/target/5944a535797cbc806d0cdc216d8de085c8eeed53 covid:ann/target/8049d6784359586993aaa68f60f405aa41a65b70 covid:ann/target/6ca833698b2897394077f4261e34f0b353bcfb50 covid:ann/target/ad23396591a6467a66493ed83ffd119917055bc7 covid:ann/target/31d5ead7e8b5967496cac83ddb31f0f343b66893 covid:ann/target/3ffd708fa44f4dea1dfb275d069a463cfb58fae5 covid:ann/target/060571d4e102fdc81d9a2bdb2df27b2f95b58e0c covid:ann/target/28ef8daa006e0fd6236d3439362b88e79fcd25c2 covid:ann/target/791461a1cd76020dd2b19cf1df155c82d6325bf1 covid:ann/target/9c2e7a933795eab1824e42be715ff842ef57fdd9 covid:ann/target/62365747f67dd526a35072ebba76d63f2355d4dc covid:ann/target/4d389aa48875d7f1f21a71cdba532d2a34acc3ef covid:ann/target/fdda3eeb213edf73d5f143959f76977c0e02935a covid:ann/target/43dfa67eac1d1e2b310cca9bb9d5f1a20e4bac88 covid:ann/target/b364105be157b73adc02dbc6fef5206b7f15d2f7 covid:ann/target/7d4ef38174305332bc39ea7767bade57a600bced covid:ann/target/8552bcee0236032f89a04bfc3e7dfcbcd8945376 covid:ann/target/8a1bc34424b414a0195e3cad106db9866fce1d16 covid:ann/target/83479b75c27566ff8f6c9518f6cad6ca6dda5832 covid:ann/target/e56b33522db5ea795e2c728b9f38a64021e52e34 covid:ann/target/19537f15b56cf629e34aa00c8245e2a368141b12 covid:ann/target/39fbd83b0473579aefb99dac779e5c9f74b255b1 covid:ann/target/7610b163a06d775822524dfdb6b515851e80fa25 covid:ann/target/0ceafdcd1fdaa6de936482636074b3e1446b2651 covid:ann/target/c53a6c68b573419acf446264d79babbbaa2dd9f1 covid:ann/target/648668fe005baf7beaa469a8dec4345ebc63d0ff covid:ann/target/e2adf7bc93ca9ffdab73ccd5d8af3640a3b68f0a covid:ann/target/5a9ce30d9fc577616a25f7029584ce8e9ff30489 covid:ann/target/f8b81f4d23623194430a1d5428167150e96c6d3f covid:ann/target/f0dbd6aec312787666ce08e09c2ec6c1829d0db5 covid:ann/target/02053fad44060e18607b46157ad881c2c6c87743 covid:ann/target/0228a755a3c844275e7883be6565898dc242a8aa covid:ann/target/2cd1f861ab340881881a881b9776f7ec26d132a1 covid:ann/target/4898e355e50ee116f32dc7cf1dda40983e0b7f66 covid:ann/target/7956761a4511e5a31b915f4f7f69059197f3d8c2 covid:ann/target/84bd944f2c4081edd42f551cea90b58e7ffc07df covid:ann/target/586966ccb512da6b43e2137a7fc3c5710921268e covid:ann/target/7f837084002167c54bf5cbe9c21a53d583f0192e covid:ann/target/ba8d1a856c9a0ee49702a4ce54ccc738ed3bdd74 covid:ann/target/f1565ac9bdc4f3868f6860eff751b0ef9dd19b9b covid:ann/target/667b9472ae467e65037f13a7b6c55a0ed55bbcb1 covid:ann/target/7e9e53a9c29da85340fb66e5fba974accd4ea7cc covid:ann/target/045b8238d05b7290b614efec70d96dc0a03d2e4e covid:ann/target/09f6edb1f2229c5e47efbbeac1e7d39d8d930be2 covid:ann/target/1aab09d1eb9de96080a469a1e0202d90c71c4750 covid:ann/target/6aeca7ebb196d55cab61c0cd68170a528ec42b3a covid:ann/target/c88e4e7f27bdcbebc50c7a2390227cf9037ae2fb covid:ann/target/3dd0809d2031b9ba8f6f30f281599935e1a91ac7 covid:ann/target/ee8c5d615a593b03412072304c9cd2b0bc25cb9a covid:ann/target/bd3fa5ec5dd29909ed08bf002f2aaa5b0feae6d0 covid:ann/target/72f99bffc4a508ea118f007f1d537788688149e9 covid:ann/target/94752fffc7b0b9b3081c84d5a3c9e9030368453e covid:ann/target/3b023ddb965f7183c6135f8eee9dc158208de0ca covid:ann/target/180e01112c959a118f82731fe9d3d99eb4ceb064 covid:ann/target/5b70a92b79dce3244b1980c94d670cda62bac44d covid:ann/target/3a30845483e2ca48ec7f691b4395b0db05e62d0f covid:ann/target/94b9d1c39428d07ae8766e2a48b4bedc413bebd7 covid:ann/target/baacafeba06df7f1e141e89ef226b48711e64dab covid:ann/target/11abb6861c4be3e72ac9480b7fd72287c537ba6e covid:ann/target/c80adbb3a731b9be59ee7296e91f5385a478c925 covid:ann/target/e169f606deb4766721f4eb56642c9a53a1805c88 covid:ann/target/f238753dd9eb6769ae2edd14124872a86d6f9bb7 covid:ann/target/0bdea2704371b504cfd30cce2f0b5281ceb9f62b covid:ann/target/28c69fa0d771fa1c8b926ba368a67f1c726e0107 covid:ann/target/7711fa45a7cf59052bd698ffb0c63355711a3607 covid:ann/target/cb80403758912ab958f20e0564e264ab2c1073b7 covid:ann/target/d6a0aaf431d1dcb91c2b9840bf1b3c9520d61e8d covid:ann/target/fe3572b662e735eb406a787b2e5a1b2ad1fc0363 covid:ann/target/ba46be5de510f02b89e956f37f88bfd84713bf88 covid:ann/target/2b2436d50a562b9087bb8fb6a0fa2768d5930538 covid:ann/target/c397801f52c3fb1d5552d880e9bc2cc72e2f7abe covid:ann/target/1043907de0d918b71e94b11eeb2dce9a7c39cab8 covid:ann/target/e25ac04a5f64cdad7887bbece81f761086e8b099 covid:ann/target/25a94c88c6d26a3dd2d188bae60d045cdfda3aca covid:ann/target/f6af4a70654b3593ff799b3b27aa1287c7e9e2bd covid:ann/target/9f07ccbf29e009b76c95751576bdef4d6867b060 covid:ann/target/b8527670897b28f5b487f4acfe791af972b95219 covid:ann/target/1169d5e7dbf158f7f416bf9ed3f1e205285c782c covid:ann/target/cdd7f4a424d30aaf4d4ef82e962e9937e81efb50 covid:ann/target/2e9d45b8c23432913203af0f14e96a955e61cf9b covid:ann/target/bdd2ca15596eed94a5b3e46823233d15f2176549 covid:ann/target/60f9060578a93ccfcf2d8c5594f637fcab177404 covid:ann/target/6d1cd79e1a9e2612ace34b3b4a8e0902f25d146d covid:ann/target/ddfe95147881027395237ea96f52486da27ac11e covid:ann/target/9791e4e1a27f39a1d08cd3ff4e51f21f079b3880 covid:ann/target/bf9d6e0bdf23db742bbf93cd838884b514c31b15 covid:ann/target/6c1b43329c9ee1e62e99cc7114920e482d0def4b covid:ann/target/7c4184f533bcc35f2669afe17a2a18d85ac87f36

Faceted Search & Find service v1.13.91 as of Mar 24 2020

Alternative Linked Data Documents: Sponger | ODE Content Formats:

RDF

ODATA

Microdata

About

OpenLink Virtuoso version 07.20.3229 as of Jul 10 2020, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (94 GB total memory)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2025 OpenLink Software