Thomas F. Peterson Engineering Laboratory 
550 Panama Mall
Building 550, Room 200
Palo Alto, CA 94305

Driving Directions 


Kianna Riley 
Plan 365 Inc 

Stanford Summer Series 
Abstracts and Speaker Bios


Intel® Distribution for Python:
A Scalability Story in Production Environments

Tuesday, June 28, 2016

In this talk we will describe the tools, techniques and optimizations that Intel brings to the Python developers community. We are developing high performance libraries and profilers as well as extending support for multi-core and SIMD parallelism across Python toolchain so that developers can achieve near native performance in Python, avoiding the need to rewrite in C/C++ 

Python use continues to grow in many domains that require interactive prototyping. Quants develop trading algorithms, data scientists build analytics models, and researchers prototype numerical simulations. All too often, scaling the prototype code to production means a developer recoding the algorithm in a language such as C++ or Java. Rewriting takes time, reduces flexibility, and can lead to errors.

In this talk we will describe the tools, techniques and optimizations that Intel brings to the Python developers community to address this major challenge. We are developing high performance libraries and profilers as well as extending support for multi-core and SIMD parallelism across Python toolchain so that developers can achieve near native performance in Python, avoiding the need to rewrite.

Our case studies will show speedups up to 100x and more from highly optimized libraries such as NumPy/SciPy, Intel® DAAL and Scikit-learn* and how those scale across multiple cores and multiple nodes. We will also see how Intel® VTune™ Amplifier allows low intrusive profiling of Python and native codes to identify performance hotspots. We will also demonstrate how tools like Cython* and Numba* allow obtaining near native code performance in numerically intensive applications. 


Sergey Maidanov leads the team of software engineers working on the optimized Intel® Distribution for Python*. He has over 15 years of experience in numerical analysis with a range of contributions to Intel software products such as Intel MKL, Intel IPP, Intel compilers, and others. Among his recently completed projects was the Intel® Data Analytics Acceleration Library. Sergey received a master’s degree in Mathematics from the State University of Nizhny Novgorod with specializations in number theory, random number generation, and its application in financial math. He was a staff member of the International Center of Studies in Financial Institutions at the State University of Nizhny Novgorod.


3 Tuning Secrets for Better OpenMP Performance
Using VTune Amplifier XE

Tuesday, July 12, 2016

Hybrid programming models, that utilize both OMP and MPI for efficient parallel scalability, are getting more complex.  Adding to the complexity of SW development, the advancement in HW designs like Intel® Xeon Phi™ processors with many cores and multiple vector processing units (VPU) per core and fast MCDRAM option offers excellent vector performance to HPC workloads.   On the serial performance, workload developers need to make use of all core design features including complex FPs and Integer instruction SSE, SIMD, AVX2, AVX512, … to obtain highest FPs and thus least execution time.  Learning the best compiler options for a particular workload as well as the memory layout of the systems like NUMA are also important.  For parallel performance tuning, scalability of OMP and MPI requires detailed OMP performance analysis and MPI communication profile.   OMP analysis may include overall data load imbalance distributed over number of OMP threads, lock and wait, thread synchronization, … An MPI communication profile can help to reduce the cost of doing communication.   Intel Parallel Studio suit includes a comprehensive set of performance tools which can be effectively use to do these tasks.  In particular, the powerful Intel VTune Performance Analyzer tool is well suit to capture deep dive performance characterization of HPC workloads.  In this presentation, we will cover Intel VTune Performance Analyzer and hands-on demo of it usage to study HPC workload performance.


Thanh Phung is a senior HPC engineer at Intel leading the HPC workload performance characterization and performance tuning.  Thanh joined Intel in 1992 working for the Supercomputer System Division (SSD) as an on-site HPC scientist at NASA/Ames and Caltech.  From 1998 to 2000 Thanh worked for Intel developing HPC tools for optical proximity correction (OPC) lithography.  From 2000 to present, Thanh worked for Intel SSG/DPD/TCAR specializing in employing performance tools like Intel VTune performance analyzer and ITAC for message profiling to do HPC workload deep dive performance analysis, vectorization tuning using  SIMD/AVX2/AVX512, OMP/MPI/Hybrid programming and scalability. Thanh holds a Ph. D. in Chemical Engineering with emphasis in CFD at Caltech in 1992.


Guided Code Vectorization with Intel® Advisor XE
Tuesday, July 19, 2016

In this topic we discuss the usage of an optimization tool called Intel® Advisor. The discussion is illustrated with an example workload that computes the electric potential in a set of points in 3-D space produced by a group of charged particles. The example workload runs on a multi-core Intel Xeon processor with Intel AVX2 instructions.

The application was originally parallelized across cores, but otherwise neither optimized nor vectorized. In the publication, we discuss three performance issues that the Intel Advisor detected: vector dependence, type conversion and inefficient memory access pattern. For each issue, we discuss how to interpret the data presented by the Intel Advisor, and also how to optimize the application to resolve these issues. After the optimization, we observed a 16x performance boost compared to the original, non-optimized implementation.


Ryo Asai is a Researcher at Colfax International. He develops optimization methods for scientific applications targeting emerging parallel computing platforms, computing accelerators and interconnect technologies. Ryo holds a B.A. degree in Physics from University of California, Berkeley.


Building Faster Machine Learning Applications
with Intel Performance Libraries

Tuesday, July 26, 2016

The future of many industries, as well as many aspects of our lives, is being shaped by machine learning and related technologies. Intel software technologies are being used to enable solutions in these areas. This talk focuses on two Intel performance libraries, MKL and DAAL, which offer optimized building blocks for data analytics and machine learning algorithms.

MKL is a collection of routines for linear algebra, FFT, vector math and statistics. It’s being used to speed up math processing in almost every kind of technical computing applications. DAAL is more focused on data applications and provides higher level, canned solutions for supervised and unsupervised learning. This session is an overview of the capability and performance advantages of these libraries in the context of machine learning and deep learning.


Zhang Zhang is a Technical Consulting Engineer with the Software and Services Group at Intel. He provides technical support for Intel performance libraries, including MKL, DAAL, and IPP. He helps customers to adopt Intel software tools and enjoys troubleshooting performance and usage problems in user’s code. Zhang came from a background of high performance and parallel programming, cluster and distributed computing, and performance modeling and analysis. Zhang holds a Ph.D. in Computer Science from Michigan Technological University.

Shaojuan Zhu is a Technical Consulting Engineer at Intel supporting Intel performance libraries: DAAL, IPP and MKL. She has ten years of experience developing and supporting media products. Her expertise and interests include biologically inspired intelligent signal processing, machine learning and media. She holds a Ph.D. in Electrical and Computer Engineering from Oregon Health and Science University.