A thesis submitted to
The Council of the College of Administration
and Economics at the University of Kerbala
as partial Fulfillment of the requirements
for the Ph.D. in Statistics
By
Saif Hosam Raheem
Supervised by
Prof. Dr. Jassim Nassir Hussain
Assist. Prof. Dr. Enas abid alhafidh mohamed
Abstract
The theory of sufficient dimension reduction (SDR) offers an essential solution to our challenges in understanding and analyzing high-dimensional data. It aims to represent important information in the data in reduced dimensions without losing any necessary information about the distribution of the data. What distinguishes SDR theory is its ability to replace high-dimensionality with a standardized, lower-dimensional projection of data leads, allowing us to analyze better and understand the data. This is accomplished by finding the central subspace that contains the most information about significant changes and patterns in the data. If the number of independent variables is greater than the sample size in the high-dimensional data, it will be challenging to analyze the regression in this case. This will lead to the complexity of the model, and therefore a challenge known as the “curse of dimensionality” (CD) arises, which will cause difficulty in dealing with data and extracting meaningful information from it.
In some cases, when we have a large number of independent variables compared to the sample size, excess variance or strong correlation can occur between the independent variables, which affects our ability to infer accurate statistical relationships between the variables. In this case, the analysis of this data will be complicated, and traditional statistical methods cannot be used to analyze it because it will give inaccurate results. To overcome these problems, multiple techniques and methods have been developed. One such technique is dimensionality reduction, in which the original independent variables are transformed into a lower dimensional space. There are two ways to reduce dimensions, namely the Variables Selection (V.S) method and the Variables extractions method.
In this thesis, it was proposed to combine one of the methods of selecting variables with two-dimension reduction methods. Specifically, we offer to employ the Reciprocal Lasso method with the (MAVE) Minimum Average Variance Estimation method, leading to the development of a new method called SMAVE-Rlasso (Sparse MAVE Reciprocal Lasso). In addition, employing the Reciprocal Lasso method with the Sliced Inverse Regression (SIR) approach leads to obtaining another proposed method known as (SSIR-Rlasso) Sparse Sliced Inverse Regression Reciprocal Lasso.
In order to verify the accuracy of the proposed methods, a comparison was made with a group of methods based on two standards of comparison (Mean of MSE) and the standard of zero coefficients (Ave0’s). The first proposed method (SMAVE-Rlasso) was compared with the methods (SMAVE), (SMAVE-EN), and (SMAVE-ADEN). The second method (SSIR-Rlasso) was compared with (SSIR), (SMAVE-EN) and (SMAVE-AL). The proposed methods proved their preference by achieving the lowest value of the criterion (Mean of MSE) and the highest value of the criterion (Ave0’s). The two proposed methods were applied to real data representing a sample of patients infected with Coronavirus, consisting of 130 infected people hospitalized in Karbala. Where the dependent variable Y was adopted, which represents the duration of hospitalization, calculated by the number of hours. And 47 independent variables were assumed, which represent a group of variables that were collected through a form prepared for this purpos