Exact methods for variable selection in linear regression with sub-setsanalysis of different tools and strategies

  1. JOAQUÍN ANTONIO PACHECO BONROSTRO 1
  2. SILVIA CASADO YUSTA 1
  1. 1 Universidad de Burgos
    info

    Universidad de Burgos

    Burgos, España

    ROR https://ror.org/049da5t36

Revista:
Rect@: Revista Electrónica de Comunicaciones y Trabajos de ASEPUMA

ISSN: 1575-605X

Año de publicación: 2017

Volumen: 18

Número: 1

Páginas: 71-92

Tipo: Artículo

DOI: 10.24309/RECTA.2017.18.1.05 DIALNET GOOGLE SCHOLAR lock_openDialnet editor

Otras publicaciones en: Rect@: Revista Electrónica de Comunicaciones y Trabajos de ASEPUMA

Resumen

En este trabajo se analiza un problema de selección de variables para regresión lineal. En este caso el conjunto de variables independientes se particiona en grupos disjuntos. El problema consiste en la selección de variables, pero con la restricción consistente en que el conjunto de variables que se seleccione debe de tener al menos una variable de cada grupo. Este problema tiene múltiples aplicaciones, concretamente el diseño de los indicadores sintéticos en diferentes áreas (sociología y economía entre otras). Los diferentes grupos de variables corresponden a los diferentes puntos de vista del problema que se está analizando. Por lo tanto estos indicadores deben de contener variables de todos los grupos. Para resolver este problema se propone un método de Branch & Bound que obtiene soluciones exactas. Además, se proponen y analizan diferentes estrategias para reducir los tiempos de cálculo de este método. Se han realizado diferentes experimentos computacionales que muestran los buenos resultados de ambas estrategias, (tanto por separado como conjuntamente): consiguen reducir notablemente los tiempos de cálculo del método Branch & Bound y permiten resolver problemas de tamaño mayor.

Información de financiación

This work was partially supported by FEDER founds and the Spanish Ministry of Economy and Competitiveness (Project ECO2013-47129-C4-3-R), the Regional Government of “Castilla y León”, Spain (Project BU329U14) and the Regional Government of “Castilla y León” and FEDER founds (Project BU062U16.). These supports are gratefully acknowledged.

Financiadores

Referencias bibliográficas

  • A. Alfons, C. Croux and S. Gelper, Sparse least trimmed squares regression for analyzing high dimensional large data sets. Annals of Applied Statistics 7, 1 (2013), 226-248.
  • O. Arslan, Weighted LAD–LASSO method for robust parameter estimation and variable selection in regression. Computational Statistics & Data Analysis 56, 6 (2012), 1952-1965.
  • R. Bandura, A Survey of composite indices measuring country performance: 2008 Update. Office of Development Studies. United Nations Development Programme, Working Paper (2008).
  • F.J. Blancas Peral, M. Gonzalez Lozano, F.M. Guerrero Casas and M. Lozano Oyola, Indicadores Sintéticos de Turismo Sostenible: Una aplicación para los destinos turísticos de Andalucia. Revista Electrónica de Comunicaciones y Trabajos de ASEPUMA, Rect@ 11 (2010), 85-118.
  • C. Bouveyron & J. Jacques, Adaptive linear models for regression: improving prediction when population has changed. Pattern Recognition Letters, 31, 14 (2010), 2237-2247.
  • L. Breiman, Better subset regression using the nonnegative garrote. Technometrics 37, 4 (1995), 373 384.
  • M.J. Brusco, A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis. Computational Statistics & Data Analysis 77 (2014), 38-53.
  • M.J. Brusco, R. Singh and D. Steinley, Variable neighborhood search heuristics for selecting a subset of variables in principal component analysis. Psychometrika 74 (2009), 705-726.
  • M.J. Brusco and D. Steinley, Exact and approximate algorithms for variable selection in linear discriminant analysis. Computational Statistics & Data Analysis 55, 1 (2011), 123-131.
  • M. Bujosa, A. García-Ferrer and A. de Juan, Predicting Recessions with Factor Linear Dynamic Harmonic Regressions. Journal of Forecasting 32 (2013), 481–499.
  • Y.K. Chan, C.C.A. Kwan and T.L.D. Shek, Quality of life in Hong Kong: the CUHK Hong Kong quality of life index. Social Indicators Research, 71 (2005), 259-289.
  • C. Cotta, C. Sloper and P. Moscato, Evolutionary search of thresholds for robust feature set selection: Application to the analysis of microarray data. Lecture Notes in Computer Science 3005 (2004), 21 30.
  • B. Efron, T. Hastie, I. Johnstone and R. Tibshirani, Least angle regression. Annals of Statistics 32, 2 (2004), 407-499.
  • J. Fan and R. Li, Variable selection via non concave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96 (2001), 1348-1360.
  • I.E. Frank and J.H. Friedman, A statistical view of some chemometrics regression tools. Technometrics 35, 2 (1993), 109-135.
  • W.J. Fu, Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics 7, 3 (1998), 397-416.
  • C. Gatu and E.J. Kontoghiorghes, Branch-and-bound algorithms for computing the best-subset regression models. Journal of Computational and Graphical Statistics 15 (2006), 139-156.
  • C. Gatu, P. Yanev and E.J. Kontoghiorghes, A graph approach to generate all possible regression submodels. Computational Statistics & Data Analysis 52, 2 (2007) 799-815.
  • R. Genuer, J.M. Poggi & C. Tuleau-Malot, Variable selection using random forests. Pattern Recognition Letters, 31, 14 (2010), 2225-2236.
  • I. Gijbels and I. Vrinssen, Robust nonnegative garrote variable selection in linear regression. Computational Statistics & Data Analysis 85 (2015), 1-22.
  • C. Hans, A. Dobra and M. West, Shotgun stochastic search for “large p” regression. Journal of the American Statistical Association, 102, 478 (2007), 507-516.
  • M.A. Hasan, M.K. Hasan and M.A. Mottalib, Linear regression-based feature selection for microarray data classification. International Journal of Data Mining and Bioinformatics 11, 2 (2015), 167-179.
  • R. Hocking, The analysis and selection of variables in linear regression. The Annals of Statistics 32, 1 (1976), 1-49.
  • J.A. Khan, S. Van Aelst. and R.H. Zamar, Robust linear model selection based on least angle regression. Journal of the American Statistical Association 102, 480 (2007), 1289-1299.
  • B.K. Kilinc, B. Asikgil, A. Erar and B. Yazici, Variable selection with genetic algorithm and multivariate adaptive regression splines in the presence of multicollinearity. International Journal of Advanced and Applied Sciences, 3, 12 (2016), 26-31
  • K. Kim and J.S. Hong, A hybrid decision tree algorithm for mixed numeric and categorical data in regression analysis. Pattern Recognition Letters, 98 (2017), 39-45.
  • A.M. López-García and R.B. Castro-Núñez, Valoración de la actividad económica regional de España a través de indicadores sintéticos. Estudios de Economía Aplicada 22, 3 (2004), 1-21.
  • S. Luo and S. Ghosal, Forward selection and estimation in high dimensional single index models. Statistical Methodology, 33 (2016), 172-179
  • J. H. Ma, Y. Leung & J.C. Luo, A highly robust estimator for regression models. Pattern recognition letters, 27, 1 (2006), 29-36.
  • R. Meiri and J. Zahavi, Using simulated annealing to optimize the feature selection problem in marketing applications. European Journal of Operational Research 171 (2006), 842-858.
  • R. Mundry and C.L. Nunn, Stepwise model fitting and statistical inference: Turning noise into signal pollution. The American Naturalist 173, 1 (2009), 119-123.
  • M. Nardo, M. Saisana, A. Saltelli, S. Tarantola, A. Hoffman and E. Giovannini, Handbook on constructing composite indicators: methodology and user guide. OECD Statistics, Working Paper 2005/3 (2005a).
  • M. Nardo, M. Saisana, A. Saltelli and S. Tarantola, Tools for composite indicators building. European Commission. Joint Research Centre. Working Paper 21682 (2005b).
  • T. Naylor, Técnicas de simulación en computadoras. Limusa (1977).
  • A.B. Owen, A robust hybrid of Lasso and Ridge regression. Technical Report. Department of Statistics, (Stanford University, 2006).
  • J. Pacheco, S. Casado and S. Porras, Exact methods for variable selection in principal component analysis: Guide functions and pre-Selection. Computational Statistics & Data Analysis 57 (2013), 95 111.
  • J. Pacheco, S. Casado and L. Núñez , A Variable Selection Method based on Tabu Search for Logistic Regression Models. European Journal of Operational Research 199, 2 (2009), 506–511.
  • S.E. Parada Rico, E. Fiallo Leal and O. Blasco-Blasco, Construcción de indicadores sintéticos basados en juicio experto: aplicación a una medida integral de excelencia académica. Revista Electrónica de Comunicaciones y Trabajos de ASEPUMA, Rect@, 16 (2015), 51-67.
  • J. Ramajo-Hernández and M.A. Márquez-Paniagua, Indicadores sintéticos de actividad económica: el caso de Extremadura. Análisis regional: el proyecto Hispalink (Cabrer-Borrás, coord.). Mundi Prensa (2001), 301-312.
  • R. Y. Rubinstein, Simulation and the Monte Carlo method, Wiley (1981).
  • B. Seijo-Pardo, I. Porto-Díaz, V. Bolón-Canedo and A. Alonso-Betanzos, Ensemble feature selection: Homogeneous and heterogeneous approaches. Knowledge-Based Systems, 118 (2017), 124-139.
  • S. Sun, Q. Peng and X. Zhang, Global feature selection from microarray data using Lagrange multipliers. Knowledge-Based Systems, 110 (2016), 267-274.
  • A. Tangian, Analysis of the third European survey on working conditions with composite indicators. European Journal of Operational Research 181, 1 (2007) 468-499.
  • R. Tibshirani, Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B 58, 1 (1996), 267-278.
  • H. Wang, G. Li and G. Jiang, Robust regression shrinkage and consistent variable selection through the LAD-Lasso. Journal of Business and Economic Statistics 25, 3 (2007), 347-355.
  • L. Wang and R. Li, Weighted Wilcoxon-type smoothly clipped absolute deviation method. Biometrics 65, 2 (2009), 564-571.
  • Y. Zhu, J. Liang, J. Chen & Z. Ming, An improved NSGA-III algorithm for feature selection used in intrusion detection. Knowledge-Based Systems, 116 (2017), 74-85.