About



My name is Yang Song (宋飏, Sòng Yáng), and I am a second year Ph.D. student at Computer Science Department, Stanford University. My advisor is Stefano Ermon. Prior to joining Stanford, I obtained my Bacheor's degree in Mathematics and Physics from Tsinghua University. Since my undergraduate study, I have been extremely fortunate to work with Jun Zhu, Raquel Urtasun, Richard Zemel and Alexander Schwing.

I am generally interested in machine learning theory and applications, especially in generative models, reinforcement learning and AI safety. You can contact me at A@B, where A = yangsong and B = cs.stanford.edu.

[My Google Scholar profile]

Experience



2016-present

Computer Science Department, Stanford University, California, USA

  • Ph.D. student in Computer Science.
Jun. - Sep. 2017

Machine Intelligence and Perception Group, Microsoft Research, Cambridge, UK

Research internship advised by Dr. Nate Kushman.

Aug. 2012 - Aug. 2016

Department of Physics, Tsinghua University, Beijing, China

  • B.S. in Mathematics and Physics.
  • Research assistant in Prof. Jun Zhu's group.
Jul. - Sep. 2015

Machine Learning Group, Department of Computer Science, University of Toronto, Toronto, Canada

Research internship advised by Prof. Raquel Urtasun and Prof. Richard Zemel.

Jul. 2014

Melbourne Graduate School of Science, University of Melbourne, Melbourne, Australia

A special summer camp for interdisciplinary study of Mathematics, Physics and Chemistry

Publications



Accelerating Natural Gradient with Higher-Order Invariance

Yang Song, Stefano Ermon

35th International Conference on Machine Learning, Stockholm, Sweden. (ICML 2018)

An appealing property of the natural gradient is that it is invariant to arbitrary differentiable reparameterizations of the model. However, this invariance property requires infinitesimal steps and is lost in practical implementations with small but finite step sizes. In this paper, we study invariance properties from a combined perspective of Riemannian geometry and numerical differential equation solving. We define the order of invariance of a numerical method to be its convergence order to an invariant solution. We propose to use higher-order integrators and corrections based on geodesics to obtain more invariant optimization trajectories. We prove the numerical convergence properties of geodesic corrected updates and show that they can be as computational efficient as plain natural gradient. Experimentally, we demonstrate that invariance leads to faster training and our techniques improve on traditional natural gradient in optimizing synthetic objectives as well as deep classifiers and autoencoders.

PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples

Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, Nate Kushman

6th International Conference on Learning Representations, Vancouver, Canada. (ICLR 2018)

Adversarial perturbations of normal images are usually imperceptible to humans, but they can seriously confuse state-of-the-art machine learning models. What makes them so special in the eyes of image classifiers? In this paper, we show empirically that adversarial examples mainly lie in the low probability regions of the training distribution, regardless of attack types and targeted models. Using statistical hypothesis testing, we find that modern neural density models are surprisingly good at detecting imperceptible image perturbations. Based on this discovery, we devised PixelDefend, a new approach that purifies a maliciously perturbed image by moving it back towards the distribution seen in the training data. The purified image is then run through an unmodified classifier, making our method agnostic to both the classifier and the attacking method. As a result, PixelDefend can be used to protect already deployed models and be combined with other model-specific defenses. Experiments show that our method greatly improves resilience across a wide variety of state-of-the-art attacking methods, increasing accuracy on the strongest attack from 63% to 84% for Fashion MNIST and from 32% to 70% for CIFAR-10.

Kernel Bayesian Inference with Posterior Regularization

Yang Song, Jun Zhu, Yong Ren

30th Annual Conference on Neural Information Processing Systems, Barcelona, Spain. (NIPS 2016)

We propose a vector-valued regression problem whose solution is equivalent to the reproducing kernel Hilbert space (RKHS) embedding of the Bayesian posterior distribution. This equivalence provides a new understanding of kernel Bayesian inference. Moreover, the optimization problem induces a new regularization for the posterior embedding estimator, which is faster and has comparable performance to the squared regularization in kernel Bayes' rule. This regularization coincides with a former thresholding approach used in kernel POMDPs whose consistency remains to be established. Our theoretical work solves this open problem and provides consistency analysis in regression settings. Based on our optimizational formulation, we propose a flexible Bayesian posterior regularization framework which for the first time enables us to put regularization at the distribution level. We apply this method to nonparametric state-space filtering tasks with extremely nonlinear dynamics and show performance gains over all other baselines.

Stochastic Gradient Geodesic MCMC Methods

Chang Liu, Jun Zhu, Yang Song

30th Annual Conference on Neural Information Processing Systems, Barcelona, Spain. (NIPS 2016)

We propose two stochastic gradient MCMC methods for sampling from Bayesian posterior distributions defined on Riemann manifolds with a known geodesic flow, e.g. hyperspheres. Our methods are the first scalable sampling methods on these manifolds, with the aid of stochastic gradients. Novel dynamics are conceived and second-order integrators are developed. By adopting embedding techniques and the geodesic integrator, the methods do not require a global coordinate system of the manifold and do not involve inner iterations. Synthetic experiments show the validity of the method, and its application to the challenging inference for spherical topic models indicate practical usability and efficiency.

Training Deep Neural Networks via Direct Loss Minimization

Yang Song, Alexander Schwing, Richard Zemel, Raquel Urtasun

33rd International Conference on Machine Learning, New York City, USA. (ICML 2016)

Supervised training of deep neural nets typically relies on minimizing cross-entropy. However, in many domains, we are interested in performing well on metrics specific to the application. In this paper we propose a direct loss minimization approach to train deep neural networks, which provably minimizes the application-specific loss function. This is often non-trivial, since these functions are neither smooth nor decomposable and thus are not amenable to optimization with standard gradient-based methods. We demonstrate the effectiveness of our approach in the context of maximizing average precision for ranking problems. Towards this goal, we develop a novel dynamic programming algorithm that can efficiently compute the weight updates. Our approach proves superior to a variety of baselines in the context of action classification and object detection, especially in the presence of label noise.

Bayesian Matrix Completion via Adaptive Relaxed Spectral Regularization

Yang Song, Jun Zhu.

30th AAAI Conference on Artificial Intelligence, Phoenix, USA. (AAAI 2016)

Bayesian matrix completion has been studied based on a low-rank matrix factorization formulation with promising results. However, little work has been done on Bayesian matrix completion based on the more direct spectral regularization formulation. We fill this gap by presenting a novel Bayesian matrix completion method based on spectral regularization. In order to circumvent the difficulties of dealing with the orthonormality constraints of singular vectors, we derive a new equivalent form with relaxed constraints, which then leads us to design an adaptive version of spectral regularization feasible for Bayesian inference. Our Bayesian method requires no parameter tuning and can infer the number of latent factors automatically. Experiments on synthetic and real datasets demonstrate encouraging results on rank recovery and collaborative filtering, with notably good results for very sparse matrices.