You are currently viewing Utilizing Big Data to Predict users Behavior in Social Networks through Logistic Regression

Utilizing Big Data to Predict users Behavior in Social Networks through Logistic Regression

     A Thesis Submitted

Council of the College of Administration and Economics at the University of Karbala, which is part of the requirements for obtaining a master’s degree inStatistics

Written by

Zahraa Hilal Hamoud
Supervised by
Pro Dr. Mushtaq Karim Abdul Rahim

Abstract

The use of big data has become highly important in the current era for prediction and decision-making processes. Big data is interpreted as a collection of deep and complex datasets gathered from various sources. Big data is characterized by several features, including variety, velocity, and volume. It exhibits diversity, encompassing structured, semi-structured, and unstructured data, and is collected from sources such as medical information and opinion-related data. This study aims to utilize big data in estimating the parameters of the logistic regression model and predicting the behavior of social media users. The binary logistic regression model, one of the most important non-linear models used in modeling, has been employed. When estimating the parameters of the binary logistic regression model using estimation methods, numerical methods sometimes fail to provide an optimal solution when conventional methods are used. Therefore, conventional methods will be improved using the genetic algorithm. A comparison will then be made between all estimation methods to select the best estimation method for the binary logistic regression model parameters. The simulation results for parameter estimation, conducted using various sample sizes and large datasets, showed that the improved maximum likelihood method is the best among all methods enhanced by the genetic algorithm. Additionally, the conventional maximum likelihood method was the best among conventional estimation methods for estimating the binary logistic regression model parameters, as both methods achieved the least mean squared error (MSE).On the practical side, real data from the social media platform Instagram was used, consisting of 58,000 users. A random sample of 50,000 users was selected, and the data was modeled. The results showed the suitability of the binary logistic regression model for modeling this data, with a correct classification rate of 84%. This indicates that the model was 84% accurate in classifying all accounts as either real or fake. The value of the Receiver Operating Characteristic (ROC) curve was 0.08, suggesting that the test can distinguish between positive and negative outcomes with a probability of 0.08.

Moreover, the study revealed that the significant factors in the model include: The number of people or pages the user follows. The length of the user’s bio Whether the user’s account has a profile picture. The availability of a link. The percentage of non-image media (ranging from 0.0 to 1.0) on the account. Instagram includes three types of media: images, videos, and carousel posts. Engagement, which resembles interaction rate but is specific to comments. The percentage of hash tags used. These factors significantly influenced the classification in the model.

Utilizing Big Data to Predict users Behavior in Social Networks through Logistic Regression

     A Thesis Submitted

Council of the College of Administration and Economics at the University of Karbala, which is part of the requirements for obtaining a master’s degree inStatistics

Written by

Zahraa Hilal Hamoud
Supervised by
Pro Dr. Mushtaq Karim Abdul Rahim

Abstract

The use of big data has become highly important in the current era for prediction and decision-making processes. Big data is interpreted as a collection of deep and complex datasets gathered from various sources. Big data is characterized by several features, including variety, velocity, and volume. It exhibits diversity, encompassing structured, semi-structured, and unstructured data, and is collected from sources such as medical information and opinion-related data. This study aims to utilize big data in estimating the parameters of the logistic regression model and predicting the behavior of social media users. The binary logistic regression model, one of the most important non-linear models used in modeling, has been employed. When estimating the parameters of the binary logistic regression model using estimation methods, numerical methods sometimes fail to provide an optimal solution when conventional methods are used. Therefore, conventional methods will be improved using the genetic algorithm. A comparison will then be made between all estimation methods to select the best estimation method for the binary logistic regression model parameters. The simulation results for parameter estimation, conducted using various sample sizes and large datasets, showed that the improved maximum likelihood method is the best among all methods enhanced by the genetic algorithm. Additionally, the conventional maximum likelihood method was the best among conventional estimation methods for estimating the binary logistic regression model parameters, as both methods achieved the least mean squared error (MSE).On the practical side, real data from the social media platform Instagram was used, consisting of 58,000 users. A random sample of 50,000 users was selected, and the data was modeled. The results showed the suitability of the binary logistic regression model for modeling this data, with a correct classification rate of 84%. This indicates that the model was 84% accurate in classifying all accounts as either real or fake. The value of the Receiver Operating Characteristic (ROC) curve was 0.08, suggesting that the test can distinguish between positive and negative outcomes with a probability of 0.08.

Moreover, the study revealed that the significant factors in the model include: The number of people or pages the user follows. The length of the user’s bio Whether the user’s account has a profile picture. The availability of a link. The percentage of non-image media (ranging from 0.0 to 1.0) on the account. Instagram includes three types of media: images, videos, and carousel posts. Engagement, which resembles interaction rate but is specific to comments. The percentage of hash tags used. These factors significantly influenced the classification in the model.