Performance Profiling

We can profile your software to determine exactly where the most time is spent and where you should focus your efforts.  We’ve built custom tools to profile applications across a range of platforms and can plug these tools into your software and platform to get a birds-eye view of bottlenecks.

Algorithm Porting

Port your algorithm to a smaller, faster, cheaper, or more power efficient platform.  We can analyze your algorithm and suggest a better platform, or target a platform you’ve already chosen.  We’re architecture agnostic, so we’ll find what works best for your software and hardware target.

Software Optimization

We can analyze your algorithm or application to optimize it for power and performance on your existing platform.  We’ll find and exploit the parallelism of both your algorithm and your hardware.  We have experience working in many application domains, and can help your software work better for you.

Full Implementation

Implement your idea or algorithm quickly and efficiently.  We can start with a simple C model, Matlab code, or just a mathematical description of your problem and implement it to your performance, power, and cost specifications.  We can also compare various platforms so you can choose what best fits your needs.

Acceleration for Any Platform

Whether its a multi-core CPU or a deeply-embedded custom design, our expertise can help accelerate your code.

Mobile SoC

Mobile SoCs can be found in smartphones, virtual reality headsets, automobiles, drones, and more.  These SoCs are inherently complex heterogeneous architectures, often combining CPU, GPU, and DSP cores on a single piece of silicon.  Fully utilizing the performance of the SoC and controlling power consumption involves careful software design choices.  We have extensive experience working with all major SoC and semiconductor vendors for optimizing for their platform.


GPUs have become ubiquitous in all levels of computing.  They can be found in the largest supercomputers in the world, and on your smartphone.  GPU programming is one of our core competencies and we have one of the largest teams in the world with knowledge in CUDA, OpenCL, C++ AMP, and other GPU-targeted APIs.  We can quickly and efficiently port your code to the GPU platform of your choice, whether it be mobile, workstation, cloud, or cluster.

Embedded, DSPs, and FPGA

Power efficiency often demands more specialized hardware and software.  Many algorithms can take advantage of the benefits of these low power platforms, but programming for them quickly becomes a challenge.  We can help port your algorithm or application down to deeply embedded architectures, DSPs, or FPGAs.  We’ve worked directly with manufacturers of these platforms to gain a deep understanding of how to use them well.

Cloud & Cluster

Computing as a service has taken off but software hasn’t caught up yet. We’ve designed applications that can scale to hundreds or thousands of compute nodes and work efficiently in GPU or FPGA-enabled cloud services.  We work directly with cloud and cluster providers to improve their back-end software and with customers who have massive compute requirements to improve their scaling, reliability, and costs.

Multi-Core CPU

CPUs have gone almost exclusively multi-core, while millions of lines of software still remains single-threaded serial code.  We can take your code, find any inherent parallelism and specialize it for ARM, Intel or AMD x86, MIPs, or IBM Power cores by taking full advantage of special SIMD instruction sets and intrinsics.  Our experience with all these architectures can help improve your performance and your power usage.

Learn more about how we can accelerate your code for these and other platforms.

We've accelerated code using all these platforms, languages and APIs and more.

Nvidia CUDA Acceleration

OpenCL Acceleration

ARM Assembly and NEON Instructions

MIPS Assembly

MATLAB Code Porting and Acceleration

C++ AMP Acceleration


OpenGL Optimization

Xilinx FPGA

Qualcomm Snapdragon and Hexagon DSP

Cadence Tensilica DSP

x86 Assembly

Don’t see your hardware or software platform here?  It might not have made the list but our expertise can certainly be applied to your platform as well.

Nvidia CUDA

MulticoreWare has one of the largest and oldest CUDA-experienced teams in the industry. Our CTO, Dr. Wen-Mei Hwu, a professor at the University of Illinois at Urbana-Champaign, worked with NVIDIA Chief Scientist Dr. David Kirk to develop the first CUDA Center of Excellence in 2008.  Our COO, Curtis Davis, was a co-Founder of AGEIA and creator of the PhysX engine.  After AGEIA was sold to NVIDIA in 2008, Curtis became VP of PhysX, leading the largest CUDA development team in the world to port PhysX to run on NVIDIA GPUs.  Curtis’s team included Dr. Lihua Zhang, MulticoreWare’s VP of China Operations, and several key engineers who joined MulticoreWare after it was founded in 2009.

Khronos OpenCL

OpenCL is the primary competitor to CUDA and supported by the most number of platforms.  MulticoreWare implements and licenses an OpenCL platform called MxPA to multiple semiconductor companies, serving as the default OpenCL device on their platforms.

MulticoreWare is a contributing member of the Khronos Group, and develops OpenCL tools. MulticoreWare has accelerated applications in a vast variety of domains including video/image processing, computer vision, Raster Image Processing (RIP), big data (Hadoop) and neural networks.

Microsoft C++ AMP

MulticoreWare works closely with Microsoft in support of extending the C++ AMP framework, enabling cross-platform support of this powerful heterogeneous computing architecture. MulticoreWare’s extensions provide the highest developer productivity combined with the highest possible performance and cross platform support.

MulticoreWare has developed Kalmar C++, a C++ compiler implementation. Kalmar is capable of taking a program conforming to C++AMP 1.2 standard and transforming it into HSAIL, SPIR binary, or OpenCL-C. With Kalmar, developers can write accelerated code using C++ AMP, targeting Windows, Linux, or MacOS systems running a variety of GPU architectures.

MATLAB Algorithm Porting & Acceleration

MATLAB’s built-in support for vector and matrix representation of the data, make it a very suitable for targeting GPU accelerated platforms.  Out of all the available MATLAB functionalities over many are already GPU-enabled, but further acceleration can be achieved by modifying the code to enable the use of more native GPU functions, keeping the data in GPU pointers and avoiding data copy between the GPU and CPU.
A typical method is to invoke kernels directly from MATLAB using MEX interface and the CUDA MATLAB plug-in. Optimizing Matlab code for performance could be done by rewriting functions in C/C++ and CUDA and calling them from MATLAB using MATLAB’s MEX interface and calling them as if they are built-in Matlab functions. Alternatively the entire algorithm can be ported to another language.

Case Studies

Case Study 1: MEX and CUDA Interface

The MulticoreWare team helped a defense industry client to achieve about 4X overall acceleration in an entire application. Matlab was retained as the driver program and MulticoreWare converted the execution of Matlab functions to Matlab Native GPU functions to achieve the first level of optimization. This was followed by identifying hot-spots where successive functions were executed using GPU data pointers without copying data back and forth between CPU and GPU. MulticoreWare also wrote custom CUDA kernels and integrated them with the Matlab framework using the MEX interface.

Case Study 2: Image-Processing Pipeline

MulticoreWare team converted stages of an existing image processing pipeline from Matlab to C/C++ and CUDA and wrote stages of the pipeline from specifications and documents from the client. An input data stream was captured using high speed cameras and processed frame-by-frame. The original application time was 10 minutes to process one frame of data. The C/C++ pipeline created by MulticoreWare cut down the processing time to one minute, a 10X speedup compared to the Matlab implementation. Further optimizations enabled the data to remain entirely on the GPU between stages of the image processing pipeline. The final per-frame processing time was reduced to less than 1 second, for an impressive 600X acceleration compared to the original Matlab pipeline.

Case Study 3: Medical X‑Ray Processing

The client approached us with a medical X-ray image processing and combination task where multiple partial images with different focal planes required filtering and recombination.  MulticoreWare analyzed the system which included an FPGA-controlled phase-array, a pipeline to move data to a workstation, and the original Matlab and C code, and recommended software and hardware specifications to allow real-time processing.  Compute-heavy sections of the code were re-written as CUDA kernels and targeted to a 9-GPU system that met all the customer’s specifications in terms of power, cost, accuracy and reliability.

FPGA Design and Development

FPGAs can offer the advantages of high performance, reconfigurability and fast development.  We’ve ported our own machine learning, computer vision and video processing libraries to FPGAs and can do the same for your applications or algorithms. MulticoreWare offers full development and design services for FPGA development.  If your FPGA supports OpenCL, we can leverage its capabilities to develop quickly and efficiently, bypassing the many man-months of effort that are typically required to enable FPGA applications. We also have RTL experts for performance-critical applications and tuning.

Xilinx Alliance Partner

MulticoreWare is an SDAccel™ development environment-certified Xilinx Alliance Member. The Xilinx Alliance Program is a worldwide ecosystem of qualified companies collaborating with Xilinx to further the development of All Programmable technologies. As a member of this alliance, MulticoreWare offers design services for Xilinx FPGAs using the SDAccel™ environment. The SDAccel™ development environment is a member of the Xilinx SDx™ family that combines the industry’s first architecturally optimizing compiler supporting any combination of OpenCL™, C, and C++ kernels, along with libraries and development boards for the first complete CPU/GPU-like development and run-time experience for FPGAs.

Open-source Projects

We’ve provided our code acceleration services for numerous application domains.  Here are just a few of the open-source projects we’ve contributed to.  Contact us to talk to us about your code or project, proprietary or open-source – and we’ll lend our expertise to the problem.


MulticoreWare was the primary contributor for the OpenCL implementation of OpenCV. This includes over 600 kernels such as face detection, Canny edge detection, Haar & Hough transform, and PyrLK/TV-L1 optical flow. We have also implemented a GPU accelerated face detection plug-in for another open source project – IrfanView, based on OpenCV.


Handbrake is a popular video transcoding application that provides transcoding to H.264 and H.265 using the x264 and x265 encoders. MulticoreWare accelerated image scaling and H.264 encoding in x264 and MulticoreWare’s x265 HEVC encoder is integrated into Handbrake as it’s core H.265 implementation.

Apache Hadoop

MulticoreWare implemented the Map-phase sort algorithm optimization and Reduce-phase merge sort optimization on the GPU. The Aparapi Matrix Multiplication was implemented on top of this by comparing OpenCL, JOCL and Aparapi. We’ve analyzed IO, checksum, compression and scheduling, and using JNI versus JOCL and implemented CRC CheckSum acceleration, Mahout K-Means acceleration and a new compressor logic.

VLC Media Player

VideoLAN (VLC) is the world’s most downloaded free and open source cross-platform multimedia player and framework that plays most types of multimedia files out-of-the-box. MulticoreWare performed OpenCL optimization on VLC, achieving up to 10x-18x speedup on integrated GPU platforms.


7-Zip is a leading open source data compression software that uses LZMA and LZMA2 algorithms to compress and decompress files of various different formats. MulticoreWare implemented an equivalent LZMA pipeline to the stock 7Zip using Suffix Array (SA), Longest Common Prefix (LCP), and Range Minimum Query (RMQ) algorithms in OpenCL. MulticoreWare also implemented a 7Zip plug-in using this OpenCL pipeline.


GIMP is the GNU Image Manipulation Program and a powerful Photoshop alternative, from which other well-known Linux graphics software and GTK+ are derived. MulticoreWare has parallelized GIMP GEGL and BABL image processing libraries with OpenCL to improve performance.  GEGL was modified to use GPU kernels to do pixel color space conversion in parallel, instead of BABL in serial.


NumPy is a numerical math package for Python. Numpy is also used in many open source projects as a library or reference source code. MulticoreWare implemented NumPy functions in OpenCL providing acceleration for cases where a large amount of data has to be processed, such as matrix multiplication.

Image Magick

ImageMagick is an open source image manipulation library and contains filters and enhancers.  ImageMagick is used in many open-source projects for image processing tasks. MulticoreWare implemented a number of ImageMagick’s image processing filters in OpenCL to enable faster processing.

x264/H.264 Encoding

x264 is the world’s leading H.264 video encoder library, used by many of the top video streaming services, social media sites and video encoder solution providers. MulticoreWare accelerated x264’s look ahead pre-encode thread with OpenCL.  MulticoreWare has also developed x265, the leading H.265 video encoder by porting and adapting many of x264’s compression algorithms, and is porting improvements back from x265 to x264.

Google Renderscript

Renderscript is Google’s compute language available on all Android phones since Jelly Bean. MulticoreWare has developed hardware-accelerated implementations of Renderscript for ARM Mali GPUs and works closely with Google in support of the Renderscript heterogeneous computing framework on Android.  We also worked with Google and Imagination to accelerate VP9 decoding on their platforms.


Snort® is an open source network Intrusion Prevention and Detection System (IDS/IPS) developed by Sourcefire. Combining the benefits of signature, protocol, and anomaly-based inspection, Snort is the most widely deployed IDS/IPS technology worldwide. MulticoreWare modified the configuration file and added the ability to select OpenCL acceleration. Implementation of the deep packet inspection kernel using a circular buffer implementation was done, and offers promising results when multiple packets are batched together for GPU kernel to process.


Memcached is a database caching mechanism commonly used as a significant optimization infrastructure by companies such as Google and Facebook, which respond to a flood of internet-driven database queries. MulticoreWare implemented a Memcached benchmark tool, several performance optimizations in the Memcached pipeline, and optimization for use of multiple UDP reply ports. The hash lookup of Memcached was implemented in OpenCL using a circular buffer technique to reduce the latency of the “get” function.


FFMpeg is a leading multimedia framework, able to encode, decode, transcode, mux, demux, stream, filter and play most video formats a user is likely to encounter. MulticoreWare integrated AMD hardware decoders (DXVA2 and OpenDecode), OpenCL accelerated filters, and the AMD hardware encoder (OpenEncode) into the FFMPEG pipeline. Many of the filters used in the FFMPEG pipeline were optimized using OpenCL. InterOp functionality was implemented between DXVA and OpenCL so that images stay on the GPU during the processing of the entire pipeline. MulticoreWare’s x265 HEVC encoder is also integrated into FFMPEG.


libjpeg-turbo is a SIMD enabled library to perform JPEG compression and decompression, and versions of it are used in Chrome and Firefox.  MulticoreWare worked to enable parallel progressive mode JPEG decoding and on a parallel implementation of the Huffman decoding algorithm.


Crypto++ (also known as CryptoPP) is a free and open source C++ class library of cryptographic algorithms and schemes. It has been widely used by academia, open source and non-commercial projects, as well as businesses. It is estimated over 200 projects use Crypto++. Due to its relatively low context dependency in some common modes, we were able to optimize its performance significantly with parallel computing. Both CPU computing power e.g. AES-NI, and OpenCL GPU-processing capabilities, were leveraged to achieve the overall processing performance by optimizing the original pipeline.

This is just a sample of what we've worked on. We help companies of all sizes tackle their software performance challenges -- both in open-source or proprietary applications.