Michele CovellSenior Staff Research Scientist
I received my BS in electrical engineering from the University of Michigan and my MS and PhD from MIT in signal processing. I joined SRI International, in the area of active acoustic-noise control, and then Interval Research Corporation, where my research covered a wide range of topics in audio, image, and video processing. In 2000, I joined YesVideo and worked on faster-than-real-time video analysis. I moved to the Mobile Streaming Media group in HP Labs, as a key contributor in streaming-video services in 3G telephony networks. This work is listed as one of the top-40 accomplishments from HP Labs' 40-year history.
We propose a method for lossy image compression based on recurrent, convolutional neural networks that outperforms BPG (4:2:0), WebP, JPEG2000, and JPEG as measured by MS-SSIM. We introduce three improvements over previous research that lead to this state-of-the-art result. First, we show that training with a pixel-wise loss weighted by SSIM increases reconstruction quality according to several metrics. Second, we modify the recurrent architecture to improve spatial diffusion, which allows the network to more effectively capture and propagate image information through the s hidden state. Finally, in addition to lossless entropy coding, we use a spatially adaptive bit allocation algorithm to more efficiently use the limited number of bits to encode visually complex image regions. We evaluate our method on the Kodak and Tecnick image sets and compare against standard codecs as well recently published methods based on deep neural networks.
Deep neural networks represent a powerful class of function approximators that can learn to compress and reconstruct images. Existing image compression algorithms based on neural networks learn quantized representations with a constant spatial bit rate across each image. While entropy coding introduces some spatial variation, traditional codecs have benefited significantly by explicitly adapting the bit rate based on local image complexity and visual saliency. This paper introduces an algorithm that combines deep neural networks with quality-sensitive bit rate adaptation using a tiled network. We demonstrate the importance of spatial context prediction and show improved quantitative (PSNR) and qualitative (subjective rater assessment) results compared to a non-adaptive baseline and a recently published image compression model based on fully-convolutional neural networks.
This paper presents a set of full-resolution lossy image compression methods based on neural networks. Each of the architectures we describe can provide variable compression rates during deployment without requiring retraining of the network: each network need only be trained once. All of our architectures consist of a recurrent neural network (RNN)-based encoder and decoder, a binarizer, and a neural network for entropy coding. We compare RNN types (LSTM, associative LSTM) and introduce a new hybrid of GRU and ResNet. We also study "one-shot" versus additive reconstruction architectures and introduce a new scaled-additive framework. We compare to previous work, showing improvements of 4.3%-8.8% AUC (area under the rate-distortion curve), depending on the perceptual metric used. As far as we know, this is the first neural network architecture that is able to outperform JPEG at image compression across most bitrates on the rate-distortion curve on the Kodak dataset images, with and without the aid of entropy coding.
A large fraction of Internet traffic is now driven by requests from mobile devices with relatively small screens and often stringent bandwidth requirements. Due to these factors, it has become the norm for modern graphics-heavy websites to transmit low-resolution, low-bytecount image previews (thumbnails) as part of the initial page load process to improve apparent page responsiveness. Increasing thumbnail compression beyond the capabilities of existing codecs is therefore a current research focus, as any byte savings will significantly enhance the experience of mobile device users. Toward this end, we propose a general framework for variable-rate image compression and a novel architecture based on convolutional and deconvolutional LSTM recurrent networks. Our models address the main issues that have prevented autoencoder neural networks from competing with existing image compression algorithms: (1) our networks only need to be trained once (not per-image), regardless of input image dimensions and the desired compression rate; (2) our networks are progressive, meaning that the more bits are sent, the more accurate the image reconstruction; and (3) the proposed architecture is at least as efficient as a standard purpose-trained autoencoder for a given number of bits. On a large-scale benchmark of 32 x 32 thumbnails, our LSTM-based approaches provide better visual quality than (headerless) JPEG, JPEG2000 and WebP, with a storage size that is reduced by 10% or more.
Like all modern internet-based video services, YouTube employs adaptive bitrate (ABR) streaming. Due to the computational expense of transcoding, the goal is to achieve a target bitrate for each ABR segment, without requiring multi-pass encoding. We extend the content-dependent model equation between bitrate and frame rate [Ma2012] to include CRF and frame size. We then attempt to estimate the content-dependent parameters used in the model equation, using simple summary features taken from the video segment and a novel neural-network layout. We show that we can estimate the correct quality-control parameter on 65% of our test cases without using a previous transcode of the video segment. If there is a previous transcode of the same segment available (using an inexpensive configuration), we increase our accuracy to 80%.
Real-time, responsive optimization of traffic flow serves to address important practical problems: reducing drivers' wasted time and improving city-wide efficiency, as well as reducing gas emissions and improving air quality. Much of the current research in traffic-light optimization relies on extending the capabilities of basic traffic lights to either communicate with each other or communicate with vehicles. However, before such capabilities become ubiquitous, opportunities exist to improve traffic lights by being more responsive to current traffic situations within the existing, deployed, infrastructure. In these papers, introduce a new approach to using local induction-loop data and show their improvement over the current observed delays in traffic flow. For better light control, we introduce micro-auctions based on local induction loop information; no other outside sources of information are assumed. To allow us to compare with current traffic, we create traffic-light logic for each light in our simulation which results in the best match to the observed travel times. We test our different light-control approaches on real-world data collected over a period of several weeks around the Mountain View, California area. In our simulations, the micro-auction mechanisms (based only on local sensor data) surpass longer-term planning approaches that rely on widely placed sensors and communications.
Much of the recent success of neural networks can be attributed to the deeper architectures that have become prevalent. However, the deeper architectures often yield unintelligible solutions, require enormous amounts of labeled data, and still remain brittle and easily broken. In this paper, we present a method to efficiently and intuitively discover input instances that are misclassified by well-trained neural networks. As in previous studies, we can identify instances that are so similar to previously seen examples such that the transformation is visually imperceptible. Additionally, unlike in previous studies, we can also generate mistakes that are significantly different from any training sample, while, importantly, still remaining in the space of samples that the network should be able to classify correctly. This is achieved by training a basket of N "peer networks" rather than a single network. These are similarly trained networks that serve to provide consistency pressure on each other. When an example is found for which a single network, S, disagrees with all of the other N - 1 networks, which are consistent in their prediction, that example is a potential mistake for S. We present a simple method to find such examples and demonstrate it on two visual tasks. The examples discovered yield realistic images that clearly illuminate the weaknesses of the trained models, as well as provide a source of numerous, diverse, labeled-training samples.
Many web-based application areas must infer label
distributions starting from a small set of sparse, noisy
labels. Previous work has shown that graph-based
propagation can be very effective at finding the best label
distribution across nodes, starting from partial information
and a weighted-connection graph. In their work on video
recommendations, Baluja et al. showed high-quality results
using Adsorption, a normalized propagation process. An
important step in the original formulation of Adsorption was
re-normalization of the label vectors associated with each
node, between every propagation step. That interleaved
normalization forced computation of all label distributions,
in synchrony, in order to allow the normalization to be
correctly determined. Interleaved normalization also
prevented use of standard linear-algebra methods, like
stabilized bi-conjugate gradient descent (BiCGStab) and
Gaussian elimination. We show how to replace the
interleaved normalization with a single pre-normalization,
done once before the main propagation process starts,
allowing use of selective label computation (label slicing)
as well as large-matrix-solution methods. As a result, much
larger graphs and label sets can be handled than in the
original formulation and more accurate solutions can be
found in fewer propagation steps. We further extend that
work to handle graphs that change and expand over time. We
report results from using pre-normalized Adsorption in topic
labeling for web domains, using label slicing and BiCGStab.
We also report results from using incremental updates on
changing co-author network data. Finally, we discuss two
options for handling mixed-sign (positive and negative)
graphs and labels.
Portions of this work were also presented at University of Tokyo, November 2012.
Portions of this work were also presented at the IEEE DSP Workshop (Napa) plenary, August 2013.
Selecting a good representation of a solution space is vital to solving any search and optimization problem. In particular, once regions of high performance are found, having the property that small changes in the candidate solution correspond to searching nearby neighborhoods provides the ability to perform effective local optimization. To achieve this, it is common for stochastic search algorithms, such as stochastic hillclimbing, evolutionary algorithms (including genetic algorithms), and simulated annealing, to employ Gray Codes for encoding ordinal points or discretized real numbers. In these papers, we present novel methods to label similar and/or close points within arbitrary graphs with small Hamming distances. The resultant point labels can be seen as an approximate high-dimensional variant of Gray Codes with standard Gray Codes as a subset of the labels found. The labeling procedure is applicable to any task in which the solution requires that search algorithm to select a small subset of items out of many. Such tasks include vertex selection in graphs, knapsack-constrained item selection, bin packing, prototype selection for machine learning, and numerous scheduling problems, to name a few.
Understanding the backgrounds and interest of the people who are consuming a piece of content, such as a news story, video, or music, is vital for the content producer as well the advertisers who rely on the content to provide a channel on which to advertise. We extend traditional search-engine query log analysis, which has primarily concentrated on analyzing either single or small groups of queries or users, to examining the complete query stream of very large groups of users -- the inhabitants of 13,377 cities across the United States. Query logs can be a good representation of the interests of the s inhabitants and a useful characterization of the city itself. Further, we demonstrate how query logs can be effectively used to gather city-level statistics sufficient for providing insights into the similarities and differences between cities. Cities that are found to be similar through the use of query analysis correspond well to the similar cities as determined through other large-scale and time-consuming direct measurement studies, such as those undertaken by the Census Bureau. Extensive experiments are provided.
Current standard image-compression approaches rely on fairly simple predictions, using either block- or wavelet-based methods. While many more sophisticated texture-modeling approaches have been proposed, most do not provide a significant improvement in compression rate over theinhabitants of 13,377 cities across the United States. Query logs can be a good representation of the interests of the s inhabitants and a useful characterization of the city itself. Further, we demonstrate how query logs can be effectively used to gather city-level statistics sufficient for providing insights into the similarities and differences between cities. Cities that are found to be similar through the use of query analysis correspond well to the similar cities as determined through other large-scale and time-consuming direct measurement studies, such as those undertaken by the Census Bureau. Extensive experiments are provided. city current standards at a workable encoding complexity level. We re-examine this area, using example-based texture prediction. We find that we can provide consistent and significant improvements over JPEG, reducing the bit rate by more than 20% for many PSNR levels. These improvements require consideration of differences between residual energy and prediction/residual compressibility when selecting a texture prediction, as well as careful control of the computational complexity in encoding.
Finding similar images in a large database is an important, but often computationally expensive, task. In this paper, we present a two-tier similar-image retrieval system with the efficiency characteristics found in simpler systems designed to recognize nearduplicates. We compare the efficiency of lookups based on random projections and learned hashes to 100-times-more-frequent exemplar sampling. Both approaches significantly improve on the results from exemplar sampling, despite having significantly lower computational costs. Learned-hash keys provide the best result, in terms of both recall and efficiency.
Several search engines now allow a hybrid search approach: starting from a text-based query, the search can be refined by picking an example image and re-ranking based on image similarity. We have found that learning distinct distance functions for different search queries can improve this re-ranking. We propose scalable solutions to learning query-specific distance functions, using the query-log data for training. This paper evaluates this approach through comprehensive human evaluation, and compares the results to Google image search.
We compare different clustering approaches for selecting a set of search-result exemplar images, similar to what is used for Image Swirl (see below). Our evaluation covers both the "correctness" and efficiency of the clustering results. We evaluate these approaches on 900 diverse queries, each associated with 1000 web images, by comparing the examplars chosen by clustering to the top 20 images for that search term. Our results suggest that Affinity Propagation is effective in selecting exemplars that match the top search images but at high computational cost. We improve on these early results using a simple distribution-based selection filter on incomplete clustering results. This improvement allows us to use more computationally efficient approaches to clustering, such as Hierarchical Agglomerative Clustering (HAC) and Partitioning Around Medoids (PAM), while still reaching the same (or better) quality of results as were given by Affinity Propagation in the original study. The computational savings is significant since these alternatives are 7 to 27 times faster than Affinity Propagation.
Swirl, released in November 2009, organizes
image search results based on their visual and semantic
similarities and presents them in an intuitive exploratory
interface. It clusters the top image search results for
more than 200,000 queries and lets you explore the clusters
and the relation between images. Once you find the group of
images you're interested in, you can click on the thumbnail
and a cluster of images will "swirl" into view. You can then
further explore additional sub-groups within any
Also demonstrated at IEEE International Conference on Computer Vision and Pattern Recognition 2010.
This work addresses the problem of finding images and line-drawings in scanned pages. It is a crucial processing step in the creation of a large-scale system to detect and index images found in books and historic documents. Within the scanned pages that contain both text and images, the images are found through the use of local-feature extraction, applied across the full scanned page. This is followed by a novel learning system to categorize the local features into either text or image. The discrimination is based on using multiple classifiers trained via stochastic sampling of weak classifiers for each AdaBoost stage. The approach taken in sampling includes stochastic hill climbing across weak detectors, allowing us to reduce our classification error by as much as 25% relative to more naive stochastic sampling. Stochastic hill climbing in the weak classifier space is possible due to the manner in which we parameterize the weak classifier space. Through the use of this system, we improve image detection by finding more line-drawings, graphics, and photographs, as well as reducing the number of spurious detections due to misclassified text, discoloration, and scanning artifacts.
Locality Sensitive Hashing (LSH) is widely used for
efficient retrieval of candidate matches in very large
audio, video, and image systems. However, extremely large
reference databases necessitate a guaranteed limit on the
memory used by the table lookup itself, no matter how the
entries crowd different parts of the signature space, a
guarantee that LSH does not give. In this paper, we provide
such guaranteed limits, primarily through the design of the
LSH bands. When combined with data-adaptive bin splitting
(needed on only 0.04% of the occupied bins), this approach
provides the required guarantee on memory usage. At the
same time, it avoids the reduced recall that more extensive
use of bin splitting would give.
Portions of this work were also presented at the IEEE DSP Workshop (Napa) plenary, August 2013.
Locality-sensitive hashing (LSH) is used in many
applications that require retrieval and evaluation of
approximate nearest neighbors. The uniformity of the
signature distributions has a strong effect on the number of
candidates that must be considered. The effects of this
uneven distribution are squared, since retrieval probes are
taken from the same uneven distribution that resulted in
uneven occupancy in the table itself. In this study, we
introduce the idea of dimension grouping to intelligently
design the hash functions that are used to index the LSH
tables. This reduces the inefficiencies introduced by
hashing real-world data that is noisy, structured, and most
importantly is not independently or identically
distributed. Through large-scale tests, we find that
permutation-grouping dramatically increases the efficiency
of the overall retrieval system by lowering the number of
low-probability candidates that must be examined by 30-50%.
Portions of this work were also presented at the IEEE DSP Workshop (Napa) plenary, August 2013.
We want to be able to efficiently find similar items in large, high-dimensional datasets, for things like music, image, and video retrieval. Scaling lookups for any large corpus tends to lead to tree or hash-based approaches. Correctly structuring such trees or hash functions is made more difficult than in other large-corpus domains by an imprecise definition of similarity. In this work, we found a method to learn a similarity function from only weakly labeled positive examples. We used this similarity function as the basis of a hash function to severely constrain the number of points considered for each lookup. In testing on a large real-world audio dataset, only a tiny fraction of the points (~0.27%) were ever considered for each lookup. To increase efficiency, we did no comparisons in the original high-dimensional space of points, yet still achieve nearly 99% accuracy on 5-second-long probes.
This paper extends the previous work on Waveprint
(see below) to handle open-set detection and identification. To do
this, we re-examine the best-ranked matches from Waveprint using
temporal-ordering-based processing. The resulting system has
excellent detection capabilities for small snippets of audio that
have been degraded in various ways, including competing noise, poor
recording quality, and cell-phone playback. The system is more
accurate than the previous state-of-the-art system while being more
efficient and flexible in memory usage and computation.
Also presented, by invitation, at International Workshop on Computer Vision, 2010.
In this paper, we introduce Waveprint, a novel
method for audio identification. Waveprint uses a
combination of computer-vision techniques and
large-scale-data-stream processing algorithms to create
compact fingerprints of audio data that can be efficiently
matched. The resulting system has excellent identification
capabilities for small snippets of audio that have been
degraded in a variety of manners, including competing noise,
poor recording quality, and cell-phone playback. We
explicitly measure the tradeoffs between performance, memory
usage, and computation through extensive experimentation.
Portions of this work were also presented at the IEEE DSP Workshop (Napa) plenary, August 2013.
We propose a method for detecting and precisely segmenting repeated sections of broadcast streams. The detection stage starts from acoustic matches and validates the hypothesized matches using the visual channel. The precise segmentation uses fine-grain acoustic match profiles to determine start and end-points. The approach is both efficient and robust to broadcast noise and differences in broadcaster signals. Our final results is nearly perfect, with better than 99% precision, at a recall rate of 95% for repeated advertisements.
We describe mass personalization, a framework for combining mass media with a highly personalized Web-based experience. We introduce four applications for mass personalization: personalized content layers, ad hoc social communities, real-time popularity ratings and virtual media library services. Using the ambient audio originating from the television, the four applications are available with no more effort than simple television channel surfing. Our audio identification system does not use dedicated interactive TV hardware and does not compromise the user's privacy. Feasibility tests of the proposed applications are provided both with controlled conversational interference and with "living-room" evaluations.
Streaming video to mobile devices like cellular phones is an important emerging application. However, mobile thin clients and telephony-infrastructure constraints limit the capabilities and interfaces available to the end user. This paper describes our design of an end-to-end interactive video telephony service. The challenges of providing interactive services is to remain compliant with the telco signaling and data protocols, without requiring software downloads onto the handset, while providing responsive, interactive controls. Our service provides VCR-like control of telephony video, maintains audio-video synchronization, and respects the video frame and bit-rate contracts of the telephony channel. This research was completed at HP Labs.
This paper describes a mathematical approach to deriving predictive models of streaming-server saturation (saturating/non-saturating) through hidden-variable estimation. We use a 240-dimensional measurement vector to estimate a low-dimensional vector of continuous valued hidden variables, with each hidden-variable dimension corresponding to a "discovered" resource on which the streaming server depends. The discovery process uses labelled calibration data, along with a POCS-like alternation of constraints, to create the best (in a total-least squares sense) predictors for server saturation, both under the current (unknown) set of client sessions and under some (proposed) increase in the client sessions. These models allow us to make admission-control decisions as well as to indicate which types of additional clients the streaming server can currently handle. (A description of the data-collection regime is available here.. A longer version of these two papers is available here.) This research was completed at HP Labs.
Methods for automatically managing the performance of computing services must estimate a performance model of that service. This paper explores properties that are necessary for performance model estimation of black-box computer systems when used together with adaptive feedback loops. It shows that the standard method of least-squares estimation often gives rise to models that make the control loop perform the opposite action of what is desired. This produces large oscillations and bad tracking performance. The paper evaluates what combination of input and output data provides models with the best properties for the control loop. Plus, it proposes three extensions to the controller that makes it perform well, even when the model estimated would have degraded performance.
Our proposed techniques are evaluated with an adaptive controller that provides latency targets for workloads on black-box computer services under a variety of conditions. The techniques are evaluated on two systems: a three-tier ecommerce site and a web server. Experimental results show that our best estimation approach improves the ability of the controller to meet the latency goals significantly. Previously oscillating workload latencies are with our techniques smooth around the latency targets. This research was completed at HP Labs.
This paper presents Media Services Architecture (MSA). MSA is a flexible, general architecture for requesting, configuring, and running services that operate on streaming audio and video as it flows through the network. MSA decomposes requested media services into modular processing components that may be distributed to servers throughout the network and which intercommunicate via standard streaming protocols. Use of standard protocols also affords seamless inter-operability between MSA and media content delivery networks. MSA manages media services by monitoring the networked servers and assigning service components to them in a manner that uses available computational and network resources efficiently. We describe some implemented example services to illustrate the concepts and benefits of the architecture. This research was completed at HP Labs.
Current mobile devices and wireless access allows delivery of video streams to a wide range of cellular devices. The wide range and variability of the network, processor, and display conditions on these devices will effectively require streaming video to be dynamically tailored to each client's changing constraints. Real-time compressed-domain video transcoding allows each live streaming session to be tailored to these changing environment in a practical and affordable manner. However, even with new advances in the video-transcoding efficiency, the computational, bandwidth, and scheduling requirements of live video processing results in the need for managed placement of these tasks, for best used of the distributed resources available with the network. In this paper, we discuss our approach to service-location management (SLM). We compare the performance alternative implementations of SLM resource monitoring. Finally, we present our conclusions on which of these alternate implementations is both most reliable and most extensible to the service of large numbers of mobile client requests. This research was completed at HP Labs.
This paper presents a novel, real-time, minimal-latency technique for dissolve detection which handles the widely varying camera techniques, expertise, and overall video quality seen in amateur, semi-professional, and professional video footage. We achieve 88% recall and 93% precision for dissolve detection. In contrast, on the same data set, at a similar recall rate (87%), DCD has more than 3 times the number of false positives, giving a precision of only 81% for dissolve detection. This research was completed at YesVideo, Inc.
This paper describes techniques to change the playback speed of MPEG-compressed audio, without first decompressing the audio file. There are two primary contributions in this paper. 1) We describe three techniques to perform time-scale modification in the maximally decimated domain. 2) We show how to infer the output of the auditory masking model on the new audio stream, using the information in the original file. This new FastMPEG algorithm is more than an order of magnitude more efficient then decompressing the audio, performing time-scale modification in the conventional time-domain, and then recompressing. Samples of our results can be found here . This research was completed at Interval Research Corporation.
This paper defines a local image transform based on cumulative similarity measures and shows it to enable efficient correspondence and tracking near occluding boundaries. Unlike traditional methods, this transform allows correspondences to be found when the only contrast present is the occluding boundary itself and when the sign of contrast along the boundary is possibly reversed. The transform is based on the idea of a cumulative similarity measure which characterizes the shape of local image homogeneity; both the value of an image at a particular point and the shape of the region with locally similar and connected values is captured. This representation is insensitive to structure beyond an occluding boundary but is sensitive to the shape of the boundary itself, which is often an important cue. We show results comparing this method to traditional least-squares and robust correspondence matching. This research was completed at Interval Research Corporation.
This paper develops an optimal linear algorithm that finds the degree of synchronization between the audio and image recordings of a human speaker. Using canonical correlation, it finds the best direction to combine all the audio and image data, projecting them onto a single axis. FaceSync uses Pearson's correlation to measure the degree of synchronization between the audio and image data. We derive the optimal linear transform to combine the audio and visual information and describe an implementation that avoids the numerical problems caused by computing the correlation matrices. This research was completed at Interval Research Corporation.
This paper explores several approaches for articulated-pose estimation, assuming that video-rate depth information is available, from either stereo cameras or other sensors. We use these depth measurements in the traditional linear brightness constraint equation, as well as in a depth constraint equation. To capture the joint constraints, we combine the brightness and depth constraints with twist mathematics. We address several important issues in the formation of the constraint equations, including updating the body rotation matrix without using a first-order matrix approximation and removing the coupling between the rotation and translation updates. The resulting constraint equations are linear on a modified parameter set. After solving these linear constraints, there is a single closed-form non-linear transformation to return the updates to the original pose parameters. We show results for tracking body pose in oblique views of synthetic walking sequences and in moving-camera views of synthetic jumping-jack sequences. We also show results for tracking body pose in side views of a real walking sequence. This research was completed at Interval Research Corporation.
Dynamic countours, or snakes, provide an effective method for tracking complex moving objects for segmentation and recognition tasks, but have difficulty tracking occluding boundaries on cluttered backgrounds. To compensate for this shortcoming, dynamic contours often rely on detailed object-shape or -motion models to distinguish between the boundary of the tracked object and other boundaries in the background. In this paper, we present a complementary approach to detailed object models: We improve the discriminative power of the local image measurements that drive the tracking process. We describe a new robust external-energy term for dynamic contours that can track occluding boundaries without detailed object models. We show how our image model improves tracking in cluttered scenes, and describe how a fine-grained image-segmentation mask is created directly from the local image measurements used for tracking. This research was completed at Interval Research Corporation.
Speech is one of the most common and richest methods that people use to communicate with one another. Our facility with this communication form makes speech a good interface for communicating with or via computers. At the same time, our familarity with speech makes it difficult to generate synthetic but natural-sounding speech and synthetic but natural-looking lip-synced faces. One way to reduce the apparent unnaturalness of synthetic audible and visual speech is to modify natural (human-produced) speech. This approach relies on examples of natural speech and on simple models of how to take those examples apart and to put them back together to create new utterances.
We discuss two such techniques in depth. The first technique, Mach1, changes the overall timing of an utterance, with little loss in comprehensibility and with no change in the wording of or emphasis within what was said or in the identity of the voice. This ability to speed up (or slow down) speech will make speech a more malleable channel of communication. It gives the listener control over the amount of time that she spends listening to a given oration, even if the presentation of that material is prerecorded. The second technique, Video Rewrite, synthesizes images of faces, lip synced to a given utterance. This tool could be useful for reducing the data rate for video conferencing [Rao98], as well as for providing photorealistic avatars. This research was completed at Interval Research Corporation.
This paper describes Mach1 , a new approach to nonuniform time compression for speech. Mach1 was designed to mimic the natural timing of fast speech. At identical overall compression rates, listener comprehension for Mach1-compressed speech increased between 5 and 31 percentage points over that for linearly compressed speech, and response times dropped by 15%. For rates between 2.5 and 4.2 times real time, there was no significant comprehension loss with increasing Mach1 compression rates. In A-B preference tests, Mach1-compressed speech was chosen 95% of the time. This paper describes the Mach1 technique and our listener-test results. Audio examples are provided on the web page referenced above. This research was completed at Interval Research Corporation.
Video Rewrite uses existing footage to create automatically new video of a person mouthing words that she did not speak in the original footage. This technique is useful in movie dubbing, for example, where the movie sequence can be modified to sync the actors' lip motions to the new soundtrack.
Video Rewrite uses computer-vision techniques to track points on the speaker's mouth in the training footage, and morphing techniques to combine these mouth gestures into the final video sequence. The new video combines the dynamics of the original actor's articulations with the mannerisms and setting dictated by the background footage. Video Rewrite is the first facial-animation system to automate all the labeling and assembly tasks required to resync existing footage to a new soundtrack. This research was completed at Interval Research Corporation.
Eigen-points estimates the image-plane locations of fiduciary points on an objects. By estimating multiple locations simultaneously, eigen-points exploits the inter-dependence between these locations. This is done by associating neighboring, inter-dependent control-points with a model of the local appearance. The model of local appearance is used to find the feature in new unlabeled images. Control-point locations are then estimated from the appearance of this feature in the unlabeled image. The estimation is done using an affine manifold model of the coupling between the local appearance and the local shape.
Eigen-points uses models aimed specifically at recovering shape from image appearance. The estimation equations are solved non-iteratively, in a way that accounts for noise in the training data and the unlabeled images and that accounts for uncertainty in the distribution and dependencies within these noise sources. This research was completed at Interval Research Corporation.
This paper describes techniques to automatically morph from one sound to another. Audio morphing is accomplished by representing the sound in a multi-dimensional space that is warped or modified to produce a desired result. The multi-dimensional space encodes the spectral shape and pitch on orthogonal axes. After matching components of the sound, a morph smoothly interpolates the amplitudes to describe a new sound in the same perceptual space. Finally, the representation is inverted to produce a sound. This paper describes representations for morphing, techniques for matching, and algorithms for interpolating and morphing each sound component. Spectrographic images of a complete morph are shown at the end of the paper. This research was completed at Interval Research Corporation.
These papers address the problem of matching distinct but related objects across images, for applications such as view-based model capture, virtual presence, and morphing special effects. A new approach to match estimation is presented for estimating correspondances between distinct objects. This approach minimizes the effects of lighting variations and non-rigid deformations. A sparse set of features are used for alignment, followed by image-plane warping, or morphing, based on these constraints. This research was completed at Interval Research Corporation.
Motion estimation is an important and well-studied method for determining correspondences between images in video processing. In contrast with motion estimation, the use of image warping on image sequences is a new and developing field. To provide image warping effects, morphing algorithms rely on mesh points to align two images. The selection of these mesh points is a labor-intensive, manual process. This paper presents one approach to automating the selection of mesh points for morphing by recognizing this problem as one of determining image correspondences. Applications include video compression, animation sequence generation and high-quality time dilation of video. This research was completed at Interval Research Corporation.
Traditional signal processing by computer has relied on numerical methods. This chapter explores how symbolic representation and manipulation methods may be incorporated in the numerical processing of signals and how these methods may be used in the design and analysis of signal processing systems. Using symbolic manipulation, a high-level compiler was developed to exploit the underlying mathematics of the DSP systems. This compiler uncovered a new, unconventional approach to implementing a modulated filter bank. The details of this newly discovered algorithm were published independently (see the next paper). This chapter also describes one approach to recording and propagating computational cost within a DSP system. This research was completed at MIT, with funding from the National Science Foundation Fellowship Program, from the Advanced Research Projects Agency, and from Sanders Associates, Inc.
An implementation of a modulated filter bank is presented which, under some operating conditions, reduces the computational complexity from O(N log N) to O(N). The new structure is efficient when the application requires dense temporal (or spatial) sampling of the STFT output. One example of such an application is code-division, multiple-beam sonar. The computational complexity of this new approach is slightly lower than of the extended Goertzel algorithm. Unlike the extended Goertzel algorithm, which is only marginally stable, the new structure is unconditionally stable. This research was completed at MIT, with funding from the National Science Foundation Fellowship Program, from the Advanced Research Projects Agency, and from Sanders Associates, Inc.
A system is presented which enforces very low data rates, insures a low maximum system delay, and allows for inexpensive decoders. To achieve this, the gray-scale video conferencing sequences were converted to binary (black/white) images. Spatial and temporal differencing and statistical coding were then used for compression as were "focus-of-attention" methods of partial update. In this manner, faces and scenes were successfully encoded and decoded using simple processors at data rates below 19.2 kbaud (hard maximum) with no more than two frames of delay. This research was completed at MIT, with funding from the National Science Foundation Fellowship Program.
An imaging and analysis regime is presented for automatically estimating MRI quality measures, such as slice thickness, separation, orientation, and focus. Results of the automated analysis for MRI system examples are in good agreement with expectations from theory and with more manual tests. This research was completed at U. Michigan Department of Radiological Physics and Engineering.