Learning Dense Correspondence via 3D-guided Cycle Consistency
Paper: CVPR 2016 (Oral)
Link: https://arxiv.org/abs/1604.05383
Approach
- Goal: predict a dence flow/corrspondence between images
- relative offset
- matchablity, 1 if correspondence exists, 0 if not
Cycle-consistency
- training real images , 3D CAD model with 2 sythetic views
- learn to predict flows , ,
- :as ground-truth: provided by the rendering machine
Learning Dense Correspondence
- minimize objective function
- Transitive flow composition ,
- Truncated Euclidean loss :
- In experiments, pixels;
- Why truncated loss: to be more robust to spurious outliers for training, especially during the early stage when the network output tends to be highly noisy.
Learning Dense Marchability
- objective function: per-pixel cross-entropy loss
- : ground-truth matchability map
- Matchability map composition:
- fix and , only train the CNN to infer (Due to the multiplicative nature in matchability composition)
Network
- feature encoder of 8 convolution layers that extracts relevant features from both input images with shared network weights;
- flow decoder of 9 fractionallystrided/up-sampling convolution (uconv) layers that assembles features from both input images, and outputs a dense flow field;
- matchability decoder of 9 uconv layers that assembles features from both input images, and outputs a probability map indicating whether each pixel in the source image has a correspondence in the target.
- conv+relu(except last uconv for decoders)
- kernel 3*3
- no pooling; stride = 2 when in/decrease the spatial dimension
- output of matchability decoder + sigmoid for normalization
- training: same network for
Experiments
Training set
- real images: PASCAL3D+ dataset
- cropped from bounding box;
- rescaled to 128*128
- 3D CAD models: ShapeNet database
- render 3D models from the same viewpoint
- choose K=20 nearest models using HOG Euclidean distance
- valid training quartet for each category: 80,000
Network training
- Initialization:
- feature encoder + flow decoder pathway: mimic SIFT flow by randomly sampling image pairs from the training quartets and training the network to minimize the Euclidean loss between the network prediction and the SIFT flow output on the sampled pair
- other initialization strategies (e.g. predicting ground-truth flows between synthetic images), and found that initializing with SIFT flow output works the best.
- Parametes:
- ADAM solver = 0.9, = 0.999, lr = 0.001, step size of 50k, step multiplier of 0.5 for 200k iterations.
- batch = 40 during initialization and 10 quartets during fine-tuning.
Feature
embedding layout appears to be viewpoint-sensitive (might implicitly learn that viewpoint is an important cue for correspondence/matchability tasks through our consistency training.)
Keypoint transfer task
Evaluate the quality ofcorrespondence output
- For each category, we exhaustively sample pairs from the val split (not seen during training), and determine if a keypoint in the source image is transferred correctly (by measuring the Euclidean distance between our correspondence prediction and the annotated ground-truth (if exists) in the target image)
- . A correct transfer: prediction falls within pixels of the ground-truth with H and W being the image height and width, respectively (both are 128 pixels in our case)
- Metric: e percentage of correct keypoint transfer (PCK)
Matchability prediction
- PASCAL-Part dataset(provides humanannotated part segment labeling)