颜色恒常性之<<A Multi-Hypothesis Approach to Color Constancy>>

Color Constancy

发布日期: 2020-10-13

文章字数: 3.8k

阅读次数:

论文思路

论文：<<A Multi-Hypothesis Approach to Color Constancy>>

Title&Abstract&Conclusion

Multi-Hypothesis？

多假设都有什么假设？

Under the prevalent assumption that the scene is illuminated by a single or dominant light source, the observed pixels of an image are typically modelled using the physical model of Lambertian image formation captured under a trichromatic photosensor:
we assume that the color of the light and the surface reflectance are independent.
the function modelling the prior also depends on factors such as the environment (indoor / outdoor), the time of day, ISO etc. However, the size of currently available datasets prevent us from modelling more complex proxies.

Our likelihood estimator learns to answer a camera-agnostic question and thus enables effective multi-camera training by disentangling illuminant estimation from the supervised learning task.

learning from image samples that were captured by multiple cameras

相机无关和多相机图片训练到底是如何实现的？

只是这个似然估计器与相机无关，多个相机获得多个数据集，对每个数据集利用KMeans找出候选光源，然后都喂入网络，实现了多相机图片训练

Figure

（d）use an illuminant candidate set per camera. [ r/g ,b/g ]

每个摄像机获得一个候选集吗？最后是如何训练的？对于每个摄像机的候选集，是如何选取划分的？

每个摄像机有自己的一个照片集，对这里的图片进行分类，每个摄像机获得一个候选集。训练应该就是将这些候选集都喂入。

[r/g,b/g]这个图如何读?

图上一个点应该是代表一个光源，为了将三维降成二维？，显示了光源的分布

如果现在有一张待还原的照片，如何还原，都要生成n个候选光源吗？怎么生成？

n个候选光源已经生成好了，现在训练网络是需要不同光源的权重配比不同！所以到时候对于待还原照片，还是相同的过程，用每个候选光源修正图片，然后放入网络，得到权重。

Introduction

$$
\rho_{k}(X)=\int_{\Omega}E(\lambda)S(\lambda,X)C_{k}(\lambda)d\lambda\quad\quad\quad k\in{R,G,B}
$$

为什么要积分？

积分是因为比如绿色，打个比方是755~760这个频段的波长共同作用生成的，所以需要进行积分。
$$
\rho_{k}^E=\int_{\Omega}E(\lambda)C_{k}(\lambda)d\lambda\quad\quad\quad k\in{R,G,B}
$$

The goal of computational CC then becomes estimation of the global illumination color$\rho_k^E$？

为什么变成了这个形式？

对于物体成像的颜色，$S(\lambda,X)$表示物体本身的影响，$\rho_k^E$就表示光源的影响。后边我们说的光源就是指$\rho_k^E$整体。

due to the ill-posed nature of the problem, multiple illuminant solutions are often possible with varying probability.

什么是ill-posed ？

适定问题是指定解满足下面三个要求的问题：① 解是存在的；② 解是唯一的；③ 解连续依赖于定解条件，即解是稳定的。这三个要求中，只要有一个不满足，则称之为不适定问题

avoid distribution shift and resulting domain gap problems [1, 41, 22], associated with camera specific training, and propose a well-founded strategy to leverage multiple data.

什么是distributin shift&domain gap？

distribution shift: https://zh.d2l.ai/

domain gap problem:https://zhuanlan.zhihu.com/p/195704051

Principled combination of datasets is of high value for learning based color constancy given the typically small nature of individual color constancy datasets (on the order of only hundreds of images).

这句话在说啥？

We provide a training-free model adaptation strategy for new cameras.

加入一个新的摄像机，如何改进模型？

新加入一个摄像机，只要这个摄像机的候选光源已知了，就可以直接拿这个网络训练了，所以不需要再重新训练或微调。

Bayesian framework

They model the prior of the illuminant and the surface reflectance as a truncated multivariate normal distribution on the weights of a linear model

什么是truncated multivariate normal distribution on the weights of a linear model?

截断正态分布：指限制变量x取值范围(scope)的一种分布。例如，限制x取值在0到50之间，即{0<x<50}。

Bayesian works [44, 23], discretise the illuminant space and model the surface reflectance priors by learning real world histogram frequencies;

通过学习真实世界的直方图频率，来离散化光源空间和对表面反射率进行先验建模。可以查看它如何学习真实世界的直方图频率应用到肤色定级。

in [44] the prior is modelled as a uniform distribution over a subset of illuminants while [23] uses the empirical distribution of the training illuminants.

对于光源概率44和23有两种想法：直接建模成均匀分布和利用训练光源的经验分布。

经验分布函数：https://zh.wikipedia.org/zh-hans/%E7%BB%8F%E9%AA%8C%E5%88%86%E5%B8%83%E5%87%BD%E6%95%B0#:~:text=%E7%BB%8F%E9%AA%8C%E5%88%86%E5%B8%83%E5%87%BD%E6%95%B0%EF%BC%88%E8%8B%B1%E8%AA%9E%EF%BC%9Aempirical,%E6%A0%B7%E6%9C%AC%E6%89%80%E5%8D%A0%E7%9A%84%E6%AF%94%E4%BE%8B%E3%80%82

Fully supervised methods

frame color constancy as a classification problem：CCC and FCCC using a color space that identifies image re-illumination with a histogram shift.

CCC和FCCC待看

Multi-device training

[37] affords fast adaptation to previously unseen cameras, and robustness to changes in capture device by leveraging annotated samples across different cameras and datasets in a meta-learning framework

meta-learning?

A recent approach [8], makes an assumption that sRGB images collected from the web are well white balanced, therefore, they apply a simple de-gamma correction to approximate an inverse tone mapping and then find achromatic pixels with a CNN to predict the illuminant.

de-gamma correction？inverse tone mapping？

Method

Let y = (yr, yg, yb) be a pixel from an input image Y in linear RGB space.

线性RGB空间？

https://www.cnblogs.com/guanzz/p/7416821.html

gamma校正将把线性颜色空间转变为非线性空间

We model the global illumination, Eq. (2), with the standard linear model [51] such that each pixel y is the product of the surface reflectance r = (rr, rg, rb) and a global illuminant ? = (?r, ?g, ?b) shared by all pixels such that

标准线性模型？

可能就是三个函数相乘得到一个线性模型？

we propose to frame the CC problem with a probabilistic generative model with unknown surface re- flectances and illuminant

概率生成模型？

公式推导

$$
P(l|Y)=\frac{P(Y|l)P(l)}{P(Y)}
$$

$$
P(Y|l)=\int_rP(Y|l,R=r)P(R=r)dr
$$

公式(4)利用了全概率公式
$$
\int_rP(Y|l,R=r)P(R=r)dr=P(R=diag(l)^{-1}Y)
$$
公式(5),由于$y_k=r_k\cdot l_k\quad\quad k\in R,G,B$ 所以当且仅当$R=diag(l)^{-1}$时，才能生成Y,所以此时$P(Y|l,R=diag(l)^{-1})=1$,$P(Y|l,R=else)=0$,所以只剩下一项$P(R=diag(l)^{-1}Y)$

We highlight that learned affine transformation parameters are training camera-dependent and provide further discussion on camera agnostic considerations in Section

为什么这个参数是摄像机依赖的？

因为$B_l$是光源的先验估计，由公式二，全局光源由光源功率和接收函数决定。所以是摄像机依赖的。

In order to estimate the illuminant l*, we optimise the quadratic cost (minimum MSE Bayesian estimator), minimised by the mean of the posterior distribution:
$$
l^*=\int_l l\cdot P(l|Y)dl
$$

为什么是这个公式？

我们现在获得了n个光源$l_0、l_1\cdots l_n$和n个概率$p_0、p_1\cdots p_n$,我们如何确定最优光源$l^*$?该论文就是简单使MSE最小，当$l^*$是期望时MSE最小，如果你忘了为啥了可以列个二次函数求导！

We require a differentiable method in order to train our model end-to-end, and therefore the use of a simple Maximum a Posteriori （MAP）inference strategy is not possible. Therefore to estimate the illuminant l*, we use the minimum mean square error Bayesian estimator, which is minimised by the posterior mean of l (c.f. Eq. (6))”

为什么MAP不行？

因为反向传播我们是需要求导的，而如果用极大后验估计求$l^*$，似然是用网络得到的，是没有办法求导的；所以我们需要采取一个办法他不需要对网络那一块求导就能得到$l^*$，所以使用最简单的方法-使MSE最小，$l^*$就是各个候选光源的期望。
$$
l^*=\sum_{i=1}^n l_i\cdot softmax(log(P(l_i|Y)))\
=\frac{1}{\sum e^{log(P(l_i|Y))}}\sum_{i=1}^nl_i\cdot e^{log(P(l_i|Y))}\
=\frac{1}{\sum P(l_i|Y)}\sum_{i=1}^nl_i\cdot P(l_i|Y)
$$

The resulting vector $l^*$ is l2-normalised.

l2-normalised？

https://blog.cweihang.io/ml/trick/l2_normalize

？？？

Results

Gehler-Shi dataset存在非一致真实值的情况 2个摄像机，分别为Canon 1D和Canon 5D 室内室外组合佳能RAW格式保存，并提供了tiff格式，还提供了颜色检查板的坐标因为自带程序包含非线性处理，所以使用Dcraw转换为tiff格式，并且只对RGGB的两个G取了平均，没有进行去马赛克，12位

NUS 8个摄像机

Cube+ Canon550D 主要室外

NUS Shi均为3折用之前工作提供的划分 Cube+没提供，所以用所有的图像训练，用比赛数据集测试，还跟人家的比赛结果比了比

NUS加了个多摄像机模式自己弄了个划分

Trimean？

三均值
$$
TM=\frac{Q_1+2Q_2+Q_3}{4}
$$
Q1,Q3为数据的两个四分位点，Q2为中位数

Appendix

1x1 Conv?

也叫Network in Network,添加了一个非线性运算，可用于压缩信道或增加信道

Towards reproducibility, and fair comparison, our suppplementary material provides the cross validation splits, used in the main paper, for multi-device training

Cross validation？

交叉验证：https://zhuanlan.zhihu.com/p/24825503

个人思路

创新点

提供了最佳光源的多个可能性
采用分类的方法而不是回归
设计的网络是摄像机无关的，可以使用多设备数据集进行训练，对于新型设备的泛化性比较好

多假设

Under the prevalent assumption that the scene is illuminated by a single or dominant light source, the observed pixels of an image are typically modelled using the physical model of Lambertian image formation captured under a trichromatic photosensor:
we assume that the color of the light and the surface reflectance are independent.
the function modelling the prior also depends on factors such as the environment (indoor / outdoor), the time of day, ISO etc. However, the size of currently available datasets prevent us from modelling more complex proxies.

网络架构

对此公式可以进行简化，原式$=S(X)\int_\Omega E(\lambda)C_k(\lambda)d\lambda\quad k\in {R,G,B}$

公式简化的两种解释

1：相机R、G、B光谱敏感函数是狄拉克δ函数，就是说，每个相机的光敏R、G、B三通道每个只能感应波长的一个值

2：RGB的能感知的光谱构成可见光的一个划分,$Sup(Rc)$支撑集表示Rc能感知的光谱。对于每个支撑集，假设反射率函数与波长无关

我们对该公式用如下形式表示：$y_k=r_k\cdot l_k \quad k\in{R,G,B}$

已知一个参数$y_k$,即我们已经知道的照片，求两个参数$r_k,l_k$,分别为物体对成像的影响和光照对成像的影响。

已知一个参数，求两个参数，约束过少。

琅伯特模型经典假设：

1：固定相机拍摄的固定场景物体颜色的改变只能由改变光照实现

2：固有物体反射率图像可以通过过滤光照颜色来实现

过滤光照颜色即除$l_k$,即$r_k=\frac{y_k}{l_k}$,只要获得$r_k$再乘以标准光照，就能获得白平衡图像。所以我们需要做的就是估计光照$l_k$

曾经困扰过我的是$l_k=E\cdot C_k$，可以变换乘法的位置吗？后来明白，我们实际进行颜色还原时，是对每一个像素点进行处理，那么每一个E和C都是一个标量！所以自然可以变换位置

之前的回归方法，是利用网络直接学习$l_k$,这样提供一个点估计，但是颜色还原问题本身具有不适定性，可能有多个$l_k$符合条件,每个$l_k$的概率不同。

所以作者想的是对于图像数据集利用K_means对光源进行聚类，获得的聚类中心点就是候选的光源，也就是多个可能性，解决了上面回归方法的单个点估计的考量。具体做法见下图:

$$
P(l|Y)=\frac{P(Y|l)P(l)}{P(Y)}
$$

$$
P(Y|l)=\int_rP(Y|l,R=r)P(R=r)dr=P(R=diag(l)^{-1}Y)
$$

第一步变换应用了全概率公式

第二步变换由于$y_k=r_k\cdot l_k\quad\quad k\in R,G,B$ 所以当且仅当$R=diag(l)^{-1}$时，才能生成Y,所以此时$P(Y|l,R=diag(l)^{-1})=1$,$P(Y|l,R=else)=0$,所以只剩下一项$P(R=diag(l)^{-1}Y)$

所以我们这个CNN网络为$f^W$,则$log(P(Y|l))=log(P(R=diag(l)^{-1}Y))=f^W(diag(l)^{-1}Y)$，即每个候选光源是场景光源的概率。

另外，在实际场景中，不同候选光源出现的概率是不同的，即$P(l)$不同,基于此我们添加了两个参数$G_l、B_l$，分别为增益系数和$log(P(l))$
$$
log(P(l|Y))=Gl\cdot log(P(Y|l))+B_l
$$
而引入这两个参数会带来问题！

本来我们的网络是摄像机无关的,因为没有$log(P(l))$ ,为什么说摄像机无关呢？

多个相机获得多个数据集，对每个数据集利用KMeans找出候选光源，然后都喂入网络，实现了多相机图片训练。假设我们现在引入了一个新的摄像机，并获得一个该摄像机的数据集，我们要做的就是对该摄像机进行K-Means，然后测试时，对于一张图片，我们现在修正图片需要做的是之前的加上新的候选光源一起修正分别得到概率就行，不需要重新训练或微调。

而引入$log(P(l))$,之前我们的公式表明，全局光源由光源功率和接收函数决定，这个时候就必然引入了摄像机关联。

所以如果要多设备训练的话，就不引入这两个参数，这样虽然降低了灵活性，少了两个学习参数，但是现在可用的数据集变多了，大数据集弥补了。

我们现在获得了n个光源$l_0、l_1\cdots l_n$和n个概率$p_0、p_1\cdots p_n$,我们如何确定最优光源$l^*$?

首先简单的MAP是不行的，因为反向传播我们是需要求导的，而如果用极大后验估计求$l^*$，似然是用网络得到的，是没有办法求导的；

所以我们需要采取一个办法他不需要对网络那一块求导就能得到$l^*$，该论文就是利用简单的线性组合获得$l^*$，使MSE最小，当$l^*$是期望时MSE最小，如果你忘了为啥了可以列个二次函数求导！
$$
l^*=\sum_{i=1}^n l_i\cdot softmax(log(P(l_i|Y)))\
=\frac{1}{\sum e^{log(P(l_i|Y))}}\sum_{i=1}^nl_i\cdot e^{log(P(l_i|Y))}\
=\frac{1}{\sum P(l_i|Y)}\sum_{i=1}^nl_i\cdot P(l_i|Y)
$$
上式使用softmax是使概率归一。

CNN结构：