1 Introduction and Motivation
Supervised learning algorithms often make the implicit assumption that the test data is drawn from the same distribution as the training data. These algorithms become ineffective when such assumptions regarding the test data are violated. Transfer learning
techniques are applied to address these kinds of problems. Transfer learning involves extracting knowledge from one or more tasks or domains and utilizing (transferring) that knowledge to design a solution for a new task or domain
[2]. Domain adaptation (DA) is a special case of transfer learning where we handle data from different, yet correlated distributions. DA techniques transfer knowledge from the source domain (distribution) to the target domain (distribution), in the form of learned models and efficient feature representations, to learn effective classifiers on the target domain.In this work we consider the problem of supervised DA where we use labeled samples from the source domain along with a limited number of labeled samples from the target domain, to learn a classifier for the target domain. We propose a Coupled linear Support Vector Machine (CSVM) model that simultaneously estimates linear SVM decision boundaries
and , for the source and target training data respectively. Using a technique termed as instance matching, researchers sample source data points such that the difference between the means of the sampled source and target data is minimized [4], [10]. Our intuition behind the CSVM is along similar lines, where we penalize the difference between and . Since the SVM decision boundaries are a linear combination of the data points, penalizing the difference between and , can be viewed as penalizing the difference between the weighted means of the source and target data points.Figure(1a), illustrates standard SVM based DA where is first learned on the source and is subsequently perturbed to obtain the target (). The perturbed SVM could be very different from and can overfit the target training data. Figure(1b), depicts the CSVM, where and , are learned simultaneously. The source SVM , provides an anchor for the target SVM . The difference between and is modeled based on the difference between the source and target domains. In addition, the CSVM trades training error for generalization as illustrated in Figure(1c). In this paper, we formulate a coupled SVM problem to estimate and and reduce it to a single SVM problem that can be solved with standard quadratic optimization. We test our model and report recognition accuracies on various datasets of objects, handwritten digits, facial expressions and activities.
2 Related Work and Our Method
In this section we will discuss some of the SVM based DA techniques closely related to the CSVM. Support Vector Machines have been extensively used for DA in the past. Daumé [3]
, modeled augmented features with a heuristic kernel. Bruzzone and Marconcini
[2], proposed an unsupervised method (DASVM) to adapt a SVM learned on the source domain to the unlabeled data in the target domain in an iterative manner. AdaptSVM is another technique closely related to our method, where Yang et al. [17] and Li [9], learn a SVM on the target by minimizing the classification error on the target data while also reducing the discrepancy between the source and target SVMs. We differ from this method by learning the source and target SVMs simultaneously. Aytar and Zisserman [1], extend this framework to the Projective Model Transfer SVM that relaxes the transfer induced by the AdaptSVM. Hoffman et al. (MMDT) [6], learn a single SVM model for the source and the transformed target data. The target data is transformed by a transformation matrix that is learned in an optimization framework along with the SVM. Duan et al. (AMKL) [4] implement a multiple kernel method where multiple base kernel classifiers are combined with a prelearned average classifier obtained from fusing multiple nonlinear SVMs. Unlike in CSVM where the similarity between source and target is learned by the model, Widmer et al. [16] use a similar approach to solve multitask problems using graph Laplacians to model task similarity. We believe the CSVM holds a unique position in this wide array of SVM solutions for DA. The CSVM trains a linear SVM for both the source and target domains simultaneously, thereby minimizing the chances of overfitting, especially when there are very few labeled samples from the target domain.3 Problem Specification
We outline the problem as follows. Let , where is the source domain, , are data points and , are their labels. Along similar lines, , where is the target domain.
3.1 CoupledSVM Model
The goal is to learn a target classifier , that generalizes to a larger subset of and does not over fit the target training data . The catch here is that the number of labeled target data points is small and . We therefore include the source data and learn the source classifier to provide an anchor point for . The source and target SVM decision boundaries are and respectively. To simplify notation we redefine, and and account for the bias by redefining, and . Incorporating these definitions, the CoupledSVM can be detailed as follows,
(1) 
Equation (1) is a variation of a standard linear SVM with two decision boundaries and an additional term relating the two boundaries. The first term captures the similarity(dissimilarity) between the source and target domains as the difference between the decision boundaries. controls the importance of this difference. The 2nd and 3rd term are the SVM regularizers. The 4th and 5th terms capture training loss, where and control the importance of the source and target misclassification respectively.
3.2 Solution
To simplify notation, we define a new set of variables based on the earlier ones. We concatenate the two SVM boundaries into a single variable, defined as, . The individual SVMs and can be extracted from using permutation matrices and , where and are binary matrices such that, and . For example, let , and . Then the permutation matrices and such that, and , are given by, , and . We also define new variables , where, and are the new data points and ,
(2) 
where, is a vector of zeros. Similarly,
(3) 
For ease of derivation, we consider the linearly separable SVM and get rid of (we will reintroduce it later). The minimization problem in Equation(1) can now be reformulated as,
(4) 
where we have defined, and used , for the first term. For the second term, we have used . We introduce Lagrangian variables to solve the problem,
(5) 
We need to minimize the Lagrangian w.r.t and maximize w.r.t to . We optimize first w.r.t by setting the derivative and get, where,
is an identity matrix and,
. By the nature of our permutation matrices and , is full rank and therefore, exists. We define and substitute for in Equation(5) to arrive at the SVM dual form which we need to maximize,(6) 
Equation(6) is the standard SVM dual where and is a vector of . To use any of the standard SVM libraries, we can set . Then . The decision boundary in the space of , is given by, . The decision boundary in the space of is given by . Therefore . We reintroduce the slack variables as constraints . We can easily extend the algorithm to the multiclass setting using onevsone or onevsall settings. Once is estimated, and is used to get the source and target SVMs.
4 Experiments
In this section we discuss the extensive experiments we conducted to study the CSVM model. We first outline the different datasets and their domains. We then outline the DA algorithms we compare against. Finally, we report the experimental details and our results.
4.1 Data Preparation
For our experiments, we consider multiple datasets from different applications and also test the CSVM with different kinds of features. For all the experiments (except OfficeCaltech) we use the following setting. For the training data, we sample examples from the source domain and examples from the target domain from every category. The test data is the remaining examples in the target domain not used for training.
OfficeCaltech datasets: For this experiment we borrow the dataset and the experimental setup outlined in [5]. The Office dataset consists of three domains, Amazon, Dslr and Webcam. The Caltech256 dataset has one domain, Caltech. All the domains consist of a set of common categories viz., {backpack, bike, calculator, headphones, computerkeyboard, laptop, monitor, computermouse, coffeemug, videoprojector}. We use the dimension SURFBoW features that are provided by[5] for our experiments. We follow the experimental setup outlined in [5]. For the training data, we sample examples from the source domain (for Amazon we use ) and examples from the target domain.
MNISTUSPS datasets: The MNIST and USPS datasets are benchmark datasets for handwritten digit recognition. These datasets contain gray scale images of digits from to . For our experiments, we have considered a subset of these datasets ( images from MINST and images from USPS) based on [10]. We refer to these domains as MNIST and USPS respectively. The images are resized to pixels and represented as vectors of length 256.
CKPlusMMI dataset: The CKPlus[11] and MMI[12] are benchmark facial expression recognition datasets. We select 6 categories viz., {anger, disgust, fear, anger, happy, sad, and surprise}, from frames with the most intense expression (peak frames) from every facial expression video sequence to get around 1500 images for each dataset with around 250 images per category. We refer to these domains as CKPlus and MMI
. We extract deep convolutional neural network based generic features which have shown astounding results across multiple applications
[13]. We therefore decided to use an ‘offtheshelf’ feature extractor developed by Simonyan and Zisserman [15]. We used the output of the first fully connected layer from the 16 weight layer model as features with dimension 4096 which were then reduced to 100 using PCA.HMDB51UCF50 dataset: We pooled common categories of activity from HMDB51[8] and UCF50[14]. The categories from UCF50 are, {BaseballPitch(throw), Basketball(shoot_ball), Biking(ride_bike), Diving(dive), Fencing
(fencing), GolfSwing(golf), HorseRiding(ride_horse), PullUps (pullup), PushUps(pushup), Punch(punch), WalkingWithDog
(walk)}. The category names from HMDB51 are in parenthesis. We refer to these domains as HMDB51 and UCF50. We extract stateoftheart HOG, HOF, MBHx and MBHy descriptors from the videos according to [7]. We pool the descriptors into one grid 1x1x1, and estimate Fisher Vectors with Gaussians. The dimension of these Fishers Vectors is . We apply PCA and reduce the dimension to .
Expt.  SVM(T)  SVM(S)  SVM(S+T)  MMDT  AMKL  CSVM 

A W(1)  56.060.95  37.361.19  51.261.19  64.871.26  67.851.06  66.401.09 
A D(2)  43.150.78  37.640.96  47.560.99  54.411.00  56.220.89  57.130.98 
W A(3)  44.391.18  32.030.90  44.870.59  50.540.82  52.960.57  53.970.42 
W D(4)  45.201.34  61.060.86  65.390.89  62.480.98  75.950.94  68.270.86 
D A(5)  42.171.03  31.480.65  46.170.44  50.450.75  52.360.57  54.100.55 
D W(6)  54.910.80  69.811.06  76.190.64  74.340.66  85.940.44  77.170.46 
A C(7)  26.620.60  38.610.50  42.460.39  39.670.50  44.920.46  44.740.57 
W C(8)  25.820.78  26.670.59  34.530.76  34.860.79  39.200.57  39.770.59 
D C(9)  26.880.74  25.740.47  34.680.67  35.820.75  41.120.44  41.270.51 
C A(10)  43.521.07  36.220.82  47.750.60  51.100.76  55.980.58  55.560.76 
C W(11)  55.491.02  29.721.54  51.281.23  62.941.11  68.701.07  67.741.05 
C D(12)  43.071.47  32.561.03  47.681.17  52.560.97  58.820.83  59.721.01 
M U(13)  70.730.41  38.890.61  64.360.41  68.960.43  79.560.30  76.020.34 
U M(14)  58.230.39  21.670.33  38.430.36  48.290.32  63.800.32  63.250.31 
K I(15)  33.310.27  13.300.15  25.830.31  18.280.42  31.870.29  33.100.29 
I K(16)  45.650.47  19.470.55  25.630.35  21.330.81  43.590.50  48.540.47 
H F(17)  28.940.26  17.450.17  23.000.19  29.050.23  33.060.23  35.890.25 
F H(18)  18.640.19  16.990.16  19.580.17  22.280.18  24.280.16  24.410.19 
4.2 Existing Methods
We compare our method with existing supervised DA techniques based on SVMs. SVM(T) (Linear SVM with training data from target domain), SVM(S) (Linear SVM with training data from source domain), SVM(S+T) (Linear SVM with union of source and target domain training data), MMDT (The MaxMargin Domain Transform [6]), AMKL (The Adaptive Multiple Kernel Learning [4]), and CSVM (Coupled SVM algorithm).
4.3 Experimental Details and Results
We conducted experiments with different combinations of datasets. Table(1) depicts the results comparing multiple algorithms. For the OfficeCaltech dataset, the results are averaged across splits of data and splits for the rest of the experiments. The results for SVM(S) demonstrate the fact that although the datasets consist of the same categories, the domains have different distributions of data points. This is also highlighted by the success of SVM(T) even with few labeled training data points. The naive union of the source and target training data is in some cases beneficial but not always, as illustrated by SVM(S+T). Amongst the algorithms we have compared with, AMKL is on par with CSVM in terms of performance. There is little to choose in terms of performance accuracies between the two. However, CSVM is the easier and simpler solution as it is a standard linear SVM unlike AMKL, which is a multiple kernel based method.
In all of these experiments we apply leaveoneout cross validation across the training target data to determine the best values of the parameters . We also studied the CSVM by varying the number of samples available for training. We dropped the Webcam and Dslr datasets as they have fewer number of data points. Figure(2a) illustrates that increasing the number of source training data points, does not affect the test accuracies. The SVM relies on support vectors to estimate the source decision boundary, and additional source training data does not modify the source boundaries by much. The effect of additional target training data is comparatively more pronounced in Figure(2b) which is intuitive. By far, the most interesting is Figure(2c). Increasing both source and target training data numbers is nearly comparable to increasing only the number of target training data points. Source training data does not contribute to the target SVM after a threshold number of training data points.
5 Conclusions
The CSVM is elegant, efficient and easy to implement. We plan to extend this work to study nonlinear adaptations in the future. We would like to model classifier similarity in an infinite dimensional (kernel) space and also contemplate on the idea of unsupervised DA.
6 Acknowledgments
This material is based upon work supported by the National Science Foundation (NSF) under Grant No:1116360. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.
References
 [1] Y. Aytar and A. Zisserman. Tabula rasa: Model transfer for object category detection. In IEEE ICCV, 2011.
 [2] L. Bruzzone and M. Marconcini. Domain adaptation problems: A dasvm classification technique and a circular validation strategy. IEEE, PAMI, 32(5):770–787, 2010.
 [3] H. Daumé III. Frustratingly easy domain adaptation. In Association of Computational Linguistics, 2007.
 [4] L. Duan, I. W. Tsang, and D. Xu. Domain transfer multiple kernel learning. IEEE PAMI, 34(3):465–479, 2012.
 [5] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In IEEE CVPR, 2012.
 [6] J. Hoffman, E. Rodner, J. Donahue, K. Saenko, and T. Darrell. Efficient learning of domaininvariant image representations. In Int’l Conference on Learning Representations (ICLR), 2013.

[7]
V. Kantorov and I. Laptev.
Efficient feature extraction, encoding, and classification for action recognition.
In IEEE CVPR, 2014.  [8] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In IEEE ICCV, 2011.
 [9] X. Li. Regularized adaptation: Theory, algorithms and applications. PhD thesis, University of Washington, 2007.
 [10] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer joint matching for unsupervised domain adaptation. In IEEE CVPR, 2014.
 [11] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended cohnkanade dataset (ck+): A complete dataset for action unit and emotionspecified expression. In IEEE (CVPRW), 2010.
 [12] M. Pantic, M. Valstar, R. Rademaker, and L. Maat. Webbased database for facial expression analysis. In IEEE ICME 2005.
 [13] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features offtheshelf: an astounding baseline for recognition. In IEEE (CVPRW), 2014, pages 512–519.
 [14] K. K. Reddy and M. Shah. Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5):971–981, 2013.
 [15] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [16] C. Widmer, M. Kloft, N. Görnitz, G. Rätsch, P. Flach, T. De Bie, and N. Cristianini. Efficient training of graphregularized multitask svms. In ECML 2012.
 [17] J. Yang, R. Yan, and A. G. Hauptmann. Adapting svm classifiers to data with shifted distributions. In Data Mining Workshops, IEEE ICDM, pages 69–76, 2007.
Comments
There are no comments yet.