Predicting Saliency and Aesthetics in Images:A Bottom‐up Perspective

Naila Murray

PhD thesis from Universitat Auntònoma de Barcelona
Advisors : Xavier Otazu , Maria Vanrell
- Dec 2012

Download the publication :

This dissertation investigates two different aspects of how an observer experiences a natural image: (i) where we look, namely, where attention is guided, and (ii) what we like, that is, whether or not the image is aesthetically pleasing. These two experiences are the subjects of increasing research efforts in computer vision. The ability to predict visual attention has wide applications, from object recognition to marketing. Aesthetic quality prediction is becoming increasingly important for organizing and navigating the ever-expanding volume of visual content available online and elsewhere. Both visual attention and visual aesthetics can be modeled as a consequence of multiple interacting mechanisms, some bottom-up or involuntary, and others top-down or task-driven. In this work we focus on a bottomup perspective, using low-level visual mechanisms and features, as it is here that the links between aesthetics and attention may be more obvious and/or easily studied. In Part 1 of the dissertation, we hypothesize that salient and non-salient image regions can be estimated to be the regions which are enhanced or assimilated in standard low-level color image representations. We prove this hypothesis by adapting a low-level model of color perception into a saliency estimation model. This model shares the three main steps found in many successful models for predicting attention in a scene: convolution with a set of filters, a center-surround mechanism and spatial pooling to construct a saliency map. For such models, integrating spatial information and justifying the choice of various parameter values remain open problems. Our saliency model inherits a principled selection of parameters as well as an innate spatial pooling mechanism from the perception model on which it is based. This pooling mechanism has been fitted using psychophysical data acquired in color-luminance setting experiments. The proposed model outperforms the state-of-the-art at the task of predicting eye-fixations from two datasets. After demonstrating the effectiveness of our basic saliency model, we introduce an improved image representation, based on geometrical grouplets, that enhances complex low-level visual features such as corners and terminations, and suppresses relatively simpler features such as edges. With this improved image representation, the performance of our saliency model in predicting eye-fixations increases for both datasets. In Part 2 of the dissertation, we investigate the problem of aesthetic visual analysis. While a great deal of research has been conducted on hand-crafting image descriptors for aesthetics, little attention so far has been dedicated to the collection, annotation and distribution of ground truth data. Because image aesthetics is complex and subjective, existing datasets, which have few images and few annotations, have significant limitations. To address these limitations, we have introduced a new large-scale database for conducting Aesthetic Visual Analysis, which we call AVA. AVA contains more than 250,000 images, along with a rich variety of annotations. We investigate how the wealth of data in AVA can be used to tackle the challenge of understanding and assessing visual aesthetics by looking into several problems relevant for aesthetic analysis. We demonstrate that by leveraging the data in AVA, and using generic low-level features such as SIFT and color histograms, we can exceed state-of-the-art performance in aesthetic quality prediction tasks. Finally, we entertain the hypothesis that low-level visual information in our saliency model can also be used to predict visual aesthetics by capturing local image characteristics such as feature contrast,grouping and isolation, characteristics thought to be related to universal aesthetic laws. We use the weighted center-surround responses that form the basis of our saliency model to create a feature vector that describes aesthetics. We also introduce a novel color space for fine-grained color representation. We then demonstrate that the resultant features achieve state-of-the-art performance on aesthetic quality classification. As such, a promising contribution of this dissertation is to show that several vision experiences - low-level color perception, visual saliency and visual aesthetics estimation - may be successfully modeled using a unified framework. This suggests a similar architecture in area V1 for both color perception and saliency and adds evidence to the hypothesis that visual aesthetics appreciation is driven in part by low-level cues.

Images and movies

BibTex references

@PhdThesis\{Mur2012,
author = "Naila Murray",
title = "Predicting Saliency and Aesthetics in Images:A Bottom\‐up Perspective",
school = "Universitat Aunt\`onoma de Barcelona",
month = "Dec",
year = "2012",
abstract = "This dissertation investigates two different aspects of how an observer experiences a natural image: (i)
where we look, namely, where attention is guided, and (ii) what we like, that is, whether or not the
image is aesthetically pleasing. These two experiences are the subjects of increasing research efforts
in computer vision. The ability to predict visual attention has wide applications, from object recognition
to marketing. Aesthetic quality prediction is becoming increasingly important for organizing and
navigating the ever-expanding volume of visual content available online and elsewhere. Both visual
attention and visual aesthetics can be modeled as a consequence of multiple interacting mechanisms,
some bottom-up or involuntary, and others top-down or task-driven. In this work we focus on a bottomup
perspective, using low-level visual mechanisms and features, as it is here that the links between
aesthetics and attention may be more obvious and/or easily studied.
In Part 1 of the dissertation, we hypothesize that salient and non-salient image regions can be
estimated to be the regions which are enhanced or assimilated in standard low-level color image representations.
We prove this hypothesis by adapting a low-level model of color perception into a saliency
estimation model. This model shares the three main steps found in many successful models for predicting
attention in a scene: convolution with a set of filters, a center-surround mechanism and spatial
pooling to construct a saliency map. For such models, integrating spatial information and justifying
the choice of various parameter values remain open problems. Our saliency model inherits a principled
selection of parameters as well as an innate spatial pooling mechanism from the perception model
on which it is based. This pooling mechanism has been fitted using psychophysical data acquired in
color-luminance setting experiments. The proposed model outperforms the state-of-the-art at the task of
predicting eye-fixations from two datasets. After demonstrating the effectiveness of our basic saliency
model, we introduce an improved image representation, based on geometrical grouplets, that enhances
complex low-level visual features such as corners and terminations, and suppresses relatively simpler
features such as edges. With this improved image representation, the performance of our saliency model
in predicting eye-fixations increases for both datasets.
In Part 2 of the dissertation, we investigate the problem of aesthetic visual analysis. While a great
deal of research has been conducted on hand-crafting image descriptors for aesthetics, little attention so
far has been dedicated to the collection, annotation and distribution of ground truth data. Because image
aesthetics is complex and subjective, existing datasets, which have few images and few annotations,
have significant limitations. To address these limitations, we have introduced a new large-scale database
for conducting Aesthetic Visual Analysis, which we call AVA. AVA contains more than 250,000 images,
along with a rich variety of annotations. We investigate how the wealth of data in AVA can be used to
tackle the challenge of understanding and assessing visual aesthetics by looking into several problems
relevant for aesthetic analysis. We demonstrate that by leveraging the data in AVA, and using generic
low-level features such as SIFT and color histograms, we can exceed state-of-the-art performance in
aesthetic quality prediction tasks.
Finally, we entertain the hypothesis that low-level visual information in our saliency model can also
be used to predict visual aesthetics by capturing local image characteristics such as feature contrast,grouping and isolation, characteristics thought to be related to universal aesthetic laws. We use the
weighted center-surround responses that form the basis of our saliency model to create a feature vector
that describes aesthetics. We also introduce a novel color space for fine-grained color representation.
We then demonstrate that the resultant features achieve state-of-the-art performance on aesthetic quality
classification.
As such, a promising contribution of this dissertation is to show that several vision experiences
- low-level color perception, visual saliency and visual aesthetics estimation - may be successfully
modeled using a unified framework. This suggests a similar architecture in area V1 for both color
perception and saliency and adds evidence to the hypothesis that visual aesthetics appreciation is driven
in part by low-level cues.",
advisor1 = "3",
advisor2 = "1",
url = "http://cat.cvc.uab.es/Public/Publications/2012/Mur2012"
}

Other publications in the database

» Naila Murray