### Matlab Code

*Note: The following code is provided "as is" without support or guarantees. Most of it is not optimized, and developed for research purposes only. Feel free to contact me for questions or suggestions.*

#### ** Main Sections **

#### ** Random Forests **

**Random Forest Clustering library:**download

**Description:**The library contains some routines for Random Forest Clustering. Random Forest clustering, in its basic version [Shi and Horvath 2006], learns a Random Forests, and derives throught it a pairwise similarity between all pairs of objects, to be used within a classic distance-based clustering method (such as spectral clustering). The library contains different RF-distances: the original of Breiman (2001) together with more recent extensions (Zhu et al CVPR 2014, Ting et al KDD 2016, Aryal et al Data Mining & Knowledge Discovery 2020). It also contains the RatioRF measure we published in IEEE TKDE in 2023.

**Related papers:**

- M. Bicego, F. Cicalese, A. Mensi: "RatioRF: a novel measure for Random Forest clustering based on the Tversky's Ratio model", IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 1, pp. 830--841, (2023)

- M. Bicego, F. Escolano: "On Learning Random Forests for Random Forest Clustering", Proc. Int. Conf. on Pattern Recognition (ICPR2020), pp. 3451--3458, (2020)

**Constrained Random Forest Clustering:**download

**Description:**This represents a simple extension of the Random Forest Clustering scheme which works in the presence of partition-level constraints, i.e. constraints which are given in terms of a subset of labels for the objects.

**Related papers:**

- M. Bicego, H.A. Hassan: "An extension of Random Forest-Clustering schemes which works with partition-level constraints", Proc. Int. Conf. on Pattern Recognition (ICPR2024), (2024)

**Dissimilarity Random Forest Clustering:**download

**Description:**This represents a Random Forest approach to clustering employable when the input is given in form of pairwise distances. This belongs to the family of distance-based clustering approaches, such as hierarchical clustering, spectral clustering, affinity propagation, and dominant set clustering (just to cite a few). This is particularly suited to perform clustering of non vectorial objects such as sequences, graphs, strings and so on, for which many different distance measures have been proposed in the literature.

**Related papers:**

- M. Bicego: "Dissimilarity Random Forest Clustering", Proc. Int. Conf. on Data Mining (ICDM2020), (2020)

- M. Bicego: "DisRFC: a dissimilarity-based Random Forest Clustering approach", Pattern Recognition, Volume 133: 109036, (2023)

**Proximity Isolation Forests:**link (Code by Antonella Mensi)

**Description:**This is the code implementing Proximity Isolation Forests, an extension of Isolation Forests (Random Forests for anomaly detection) employable when the input is given in form of pairwise distances. This is particularly suited to perform anomaly detection of non vectorial objects such as sequences, graphs, strings and so on, for which many different distance measures have been proposed in the literature.

**Related papers:**

- A. Mensi, M. Bicego, D. Tax: "Proximity Isolation Forests", Proc. Int. Conf. on Pattern Recognition (ICPR2020), pp. 8021--8028, (2020)

- A. Mensi, D. Tax, M. Bicego: "Detecting Outliers from Pairwise Proximities: Proximity Isolation Forests", Pattern Recognition, Volume 138: 109334, (2023)

#### ** Clustering and Biclustering **

**Dominant Set Biclustering:**download

**Description:**This code implements the biclustering extension of the well known Dominant Set Clustering method of [Pavan&Pelillo ICCV 2003]. The algorithm is suitable when background has low values with respect to the biclusters -- in principle, background is zero, biclusters are positive (larger than zero). The algorithm returns the largest bicluster: to extract more biclusters, as often done in the literature, one can mask the obtained bicluster and then search for the next one (e.g. by inserting zeros in the corresponding positions of the matrix.)

**Related papers:**

- M. Denitto, M. Bicego, A. Farinelli, M. Pelillo: "Dominant Set Biclustering", Proc. EMMCVPR, LNCS 10746, pp. 49–61, (2017)

- M. Denitto, M. Bicego, A. Farinelli, S. Vascon, M. Pelillo: "Biclustering with Dominant Sets", Pattern Recognition, vol 104, pp. 107318, (2020)

**Spike and Slab Biclustering:**download (Code by Matteo Denitto)

**Description:**This represents a probabilistic approach to biclustering, based on a novel generative model which approaches biclustering from a sparse low-rank matrix factorization perspective. The main idea is to design a probabilistic model describing the factorization of a given data matrix in two other matrices, from which information about rows and columns belonging to the sought for biclusters can be obtained. One crucial ingredient in the proposed model is the use of a spike and slab sparsity-inducing prior, thus we term the approach spike and slab biclustering (SSBi). The code contains both the original version (described in the Pattern Recognition paper) and the version which permits to include priors such as spatial proximity (described in the ICCV paper).

**Related papers:**

- M. Denitto, M. Bicego, A. Farinelli, M.A.T. Figueiredo: "Spike and Slab Biclustering", Pattern Recognition, vol 72, pp. 186--195, (2017)

- M. Denitto, S. Melzi, M. Bicego, U. Castellani, A. Farinelli, M.A.T. Figueiredo, Y. Kleiman, M. Ovsjanikov: "Region-based Correspondence Between 3D Shapes via Spatially Smooth Biclustering", Proc. Int. Conf. on Computer Vision (ICCV2017), (2017)

**C-link, a hierarchical Clustering Approach to large-scale Coalition Formation:**download

**Description:**The code implements Coalition Linkage (C-Link), a coalition formation algorithm inspired by the well known class of hierarchical agglomerative clustering techniques (Linkage algorithms). This heuristic method is able to scale to thousands of agents, still providing high quality solutions.

**Related papers:**

- A. Farinelli, M. Bicego, S.D. Ramchurn, M. Zucchelli: "C-Link: A Hierarchical Clustering Approach to Large-scale Near-optimal Coalition Formation", Proc. Int. Joint Conf. on Artificial Intelligence (IJCAI2013), (2013)

- A. Farinelli, M. Bicego, F. Bistaffa, S.D. Ramchurn: "A hierarchical clustering approach to large-scale near-optimal coalition formation with quality guarantees", Engineering Applications of Artificial Intelligence, vol. 59, pp. 170-185, (2017)

**Weighted One Class Support Vector Machine:**download

**Description:**The code implements a weighted version of the One Class Support Vector Machine tool, in particular in the version of [Tax&Duin PRL 99] called Support Vector Data/Domain Description (SVDD). The method permits to train a OCSVM when each input point comes with an associated weight, possibly indicating its importance. The method is at the basis of a soft kernel clustering algorithm approach described in the PR paper, in which each cluster is modelled by a weighted one-class support vector machine, and the final clustering is obtained with a EM-like iterative procedure. NOTE: You need to install the dd_tools library of David Tax.

**Related papers:**

- M. Bicego, M.A.T. Figueiredo: "Soft Clustering using Weighted One Class Support Vector Machines", Pattern Recognition, vol. 42(1), pp. 27-32, (2009)

#### ** Other methods in Statistical Pattern Recognition **

**NIR (No/Null Information Rate) test:**download

**Description:**The code implements a statistical test which can be used to answer to the following question: "Is the accuracy obtained by my classifier high enough?", or better, "Can we say, with a statistically significant confidence, that our classification system is able to solve the problem?" This natural question arises in many research contexts, especially in the biomedical field.

**Related papers:**

- M. Bicego, A. Mensi: "Null/No Information Rate (NIR): a statistical test to assess if a classification accuracy is significant for a given problem", arXiv preprint arXiv:2306.06140, (2023) arXiv preprint

**Componential Counting Grids:**download (Code by Alessandro Perina)

**Description:**The code implements the Componential Counting Grid model, a probabilistic model for documents which start from a Bag Of Words representation. The model extends the Counting grid method with by adding to it the componential nature of topic models, resulting in a much more flexible description. NOTE: You may be interested also in the code of the original Counting Grid Model (N. Jojic, A. Perina: "Multidimensional counting grids: Inferring word order from disordered bags of words", UAI 2011): download

**Related papers:**

- A. Perina, N. Jojic, M. Bicego, A. Truski: "Documents as multiple overlapping windows into grids of counts", Proc. Advances in Neural Information Processing Systems (NIPS2013), (2013)

**Pruning Model Selection for Hidden Markov Models:**download

**Description:**The code implements a sequential strategy usable to select the most appropriate number of states in a Hidden Markov Models. The basic idea is to perform a decreasing learning, starting each training session from a nearly good situation, derived from the result of the previous training session by pruning the least probable state of the model.

**Related papers:**

- M. Bicego, V. Murino, M.A.T. Figueiredo: "A sequential pruning strategy for the selection of the number of states in Hidden Markov Models". Pattern Recognition Letters , vol. 24(9-10), pp. 1395-1407, (2003).

#### ** Protein Remote Homology Detection **

**Soft Ngram representation and models for Protein Remote Homology Detection:**link (Code by Pietro Lovato)

**Description:**The code implements the Soft Bag of Words and the soft Probabilistic Latent Semantic Analysis (PLSA), a novel representation and model for Protein Remote Homology Detection. The tools permit to characterize scenarios, such as the PRHD, in which words are equipped with weigths, which indicate their importance.

**Related papers:**

- P. Lovato, M. Cristani, M. Bicego: "Soft Ngram representation and modeling for protein remote homology detection", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 14(6), pp. 1482-1488, (2017)

**A multimodal approach for Protein Remote Homology Detection:**download (Code by Pietro Lovato)

**Description:**This represents a multimodal approach to Protein Remote Homology Detection, an approach, based on topic models, which is able to improve a description learned from a set of sequences by using knowledge derived from a partial set of corresponding 3D structures.

**Related papers:**

- P. Lovato, A. Giorgetti, M. Bicego: "A multimodal approach to protein remote homology detection", poster at the European Conference on Computational Biology (ECCB2014), (2014)

- P. Lovato, A. Giorgetti, M. Bicego: "A multimodal approach for protein remote homology detection", IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 12(5), pp. 1193-1198, (2015)

**Dataset based on Scop 2.04 for Protein Remote Homology Detection:**download

**Description:**This represents a novel superfamily benchmark, created from the most recent and updated SCOP 2.04, for Protein Remote Homology Detection. The dataset has been used in the papers listed below.

**Related papers:**

- P. Lovato, M. Cristani, M. Bicego: "Soft Ngram representation and modeling for protein remote homology detection", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 14(6), pp. 1482-1488, (2017)

- A. Mensi, M. Bicego, P. Lovato, M. Loog, D.M.J. Tax: "Protein Remote Homology Detection Using Dissimilarity-Based Multiple Instance Learning", Proc. Int. W. on S+SSPR, pp. 119-129, (2018)

- A. Mensi, M. Bicego, P. Lovato, M. Loog, D.M.J. Tax: "A dissimilarity-based Multiple Instance Learning approach for Protein Remote Homology Detection", Pattern Recognition Letters, vol. 128, pp. 231--236 (2019)

#### ** Expression data **

**Gene Expression modeling with counting grids:**link (Code by Pietro Lovato)

**Description:**The code implements a mining approach for gene expression data. In particular the approach is aimed at extracting an informative representation of gene expression profiles, based on a generative model called the Counting Grid (CG), useful to distill information and derive compact interpretable representations of the statistical patterns present in the gene expression data.

**Related papers:**

- P. Lovato, M. Bicego, M. Cristani, N. Jojic, A. Perina: "Feature selection using Counting Grids: application to microarray data", Proc. Int. Workshop on Statistical Techniques in Pattern Recognition (SPR2012), (2012)

- A. Perina, M. Kesa, M. Bicego: "Expression Microarray data classification using Counting Grids and Fisher Kernel", Proc. Int. Conf on Pattern Recognition (ICPR2014), 1770-1775, (2014)

- P. Lovato, M. Bicego, M. Kesa, N. Jojic, V. Murino, A. Perina: "Traveling on discrete embeddings of gene expression", Artificial Intelligence in Medicine, vol. 70, pp. 1-11, (2016)

**Gene Expression classification with topic models and IT kernels:**link (Code by Pietro Lovato)

**Description:**The code implements a classification approach for gene expression. The method represents a hybrid generative-discriminative approach, which extracts, via the so-called generative embeddings, a set of discriminative features from topic models (such as PLSA or LDA) learnt from data. Obtained features are then classified using advanced Information Theoretic kernels, such as the Jensen-Tsallis (JT) kernel of [Martins et al JMLR 2009].

**Related papers:**

- M. Bicego, P. Lovato, B. Oliboni, A. Perina "Expression microarray classification using topic models", Proc. of ACM SAC - Bioinformatics and Computational Biology track (SAC-BIO2010), pp. 1516-1520, (2010)

- M. Bicego, P. Lovato, A. Perina, M. Fasoli, M. Delledonne, M. Pezzotti, A. Polverari, V. Murino: "Investigating topic models' capabilities in expression microarray data classification", IEEE/ACM Trans. on Computational Biology and Bioinformatics, vol. 9(6), pp. 1831-1836, (2012)

**Bag of Peaks Representation for NMR spectrometry:**download

**Description:**The code implements the Bag of Peaks representation, a novel approach which adapts the bag of words paradigm to the NMR spectrometry case. The main features are the interpretability and the high discriminativeness, as shown in the Bioinformatics paper.

**Related papers:**

- M. Bicego, G. Brelstaff, N. Culeddu, M. Chessa: "Bag of Peaks: Interpretation of NMR Spectroscopy", ECCB (2008)

- G. Brelstaff, M. Bicego, N. Culeddu, M. Chessa: "Bag of Peaks: interpretation of NMR spectrometry", Bioinformatics, vol. 25, pp 258-264, (2009)