Abstract

Probabilistic graphical models and machine learning are powerful data-driven tools for extracting useful knowledge from historical data; this knowledge can facilitate improved decision-making. With the prevailing efforts to combat the coronavirus disease 2019 (COVID-19) pandemic, there are still uncertainties that are yet to be discovered about its spread, future impact, and resurgence. In this paper, a data-driven approach has been adopted in distilling the hidden information about COVID-19 and its symptoms. This paper proposes: a Bayesian network which encodes the causal relationships among COVID-19 symptoms, an unsupervised machine learning algorithm that learns symptoms pattern in COVID-19 dataset, a deep neural network which predicts the symptoms class of patients based on clustering experience with a 99.47% testing accuracy, and a time-series forecasting model that predicts the trend of COVID-19. The results from the experiments show the capability of data-driven methods in addressing the concerns of the society and government in understanding the uncertainties about the virus, providing insights on developing policies, and reducing the spread of the virus.

Introduction

The unexpected havoc caused by the coronavirus disease 2019 (COVID-19) has made the society and government to prioritize health and wellness than ever before. The current COVID-19 pandemic which was discovered in Wuhan China in 2019 (Adetiba et al. 2022), is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, and since its emergence, lives have been lost, and the economy of nations has been greatly affected. Sadly, the wide spread of the disease has influenced the social well-being of people; this is because the virus spreads between people who are in proximity, especially in poor ventilated rooms (Saadat, Rawtani, and Hussain 2020). People often get infected when they touch their eyes, nose, or mouth after contact with an infected surface. We can therefore assert that it is difficult for individuals to take strict actions to protect themselves against the virus, and even more challenging for governments to accurately monitor, predict (or forecast) and prevent the spread.

Efforts have been made to curb the spread of the virus, and such efforts include providing safety guidelines on social distancing, use of face mask, and frequent hand washing. Most importantly, vaccination measures have proven to be effective in alleviating the increasing spread of the COVID-19 disease. Considering the COVID-19 mitigation strategies employed by the United States (U.S) government, there is need to critically study the virus cases, especially the health symptoms associated with them and the severity to respiratory failure or death (Amon 2020). According to the U.S centers for disease control and prevention (CDC) (“CDC Covid Data Tracker,” n.d.), as of December 3, 2022, there are over 98.92 million confirmed cases and 1.08 million deaths caused by COVID-19 in the U.S; the randomness in the increase and decrease in the daily recorded cases further shows the importance of accurately forecasting the spread of the virus alongside studying its symptoms. Illinois which is ranked among the top-10 states in the U.S by population, as of December 3, 2022, has recorded over 3.89 million and 40,330 confirmed cases and death cases respectively. Among the 102 counties in the State of Illinois, the Cook County is the most populated. The CDC statistics show that there are over 1.45 million confirmed cases and 15,510 deaths in Cook County as of December 3, 2022; this accounts for 37% and 38% of the total confirmed cases and deaths due to COVID-19 in Illinois respectively. The statistical data on the global, continental, national, state, and county spread of the virus is available on the CDC repository and website (“CDC Covid Data Tracker,” n.d.).

The severity of the impact of COVID-19 on the health, and social behavior of people has recently continued to increase, even though between 2021 and 2022, there was a reduction in the spread of the initial variants through the administering of vaccines (“Covid-19 Vaccines,” n.d.) - Pfizer-BioNTech, Moderna, Novavax, and Johnson & Johnson’s, as well as contact tracing. The recent increase in the spread of the virus stems from the emergence of new variants, some of which pose greater threats than the early ones; this is because it takes some time to study the variants, its symptoms, diagnosis, how the human respiratory system reacts to it, and how to develop a potent vaccine against it. Apparently, the new variants have caused health agencies (e.g CDC, NIH and WHO) to advocate for the booster version of the available vaccines; this requires that people get multiple doses of the COVID-19 vaccines to enjoy better health in the face of the growing spread of the new variants.

It is possible that if the society and people have an understanding about the COVID-19 disease, its symptoms, and how it spreads, they can better plan their lives and become resilient to the threat it poses. However, such public awareness strategy relies on availability and access to historical data, and knowledge about the disease. In most cases, only the health agencies and government are privy to the data, while the society is served with “second-hand” information about the projection (forecasting), and safety measures for curtailing the disease. Since there are several symptoms associated with COVID-19, it is possible to study the relationship between the symptoms and the disease. Such information can guide the government, health agencies and the society in policy making, vaccine development and responsiveness to safety respectively.

This paper addresses the problems in three steps: 1) develop a probabilistic graphical model from COVID-19 data, where the nodes represent symptoms (i.e variables or features), and the edges represent the causal relationship between them. The idea is that, since there are several features collected for patients’ samples, it is intuitive to study the (in)dependence mapping between these features; 2) develop an unsupervised machine learning (ML) algorithm that learns (and extracts) the symptom similarity between the COVID-19 cases and groups them into appropriate classes (or clusters). The task of the ML algorithm is unsupervised as it requires no human intervention on labeling the patients’ samples according to some prior knowledge; 3) develop a supervised learning algorithm which can accurately predict the trend of the virus spread using COVID-19 time-series data.

This paper proposes the following contributions: 1) vaccines can be improved through knowledge from the conditional probability distribution of variables in the COVID-19 data; for example if certain symptoms have a high probability of causing COVID-19 deaths, such symptoms can be further studied to understand its nature and strengths; 2) knowledge from clustering the patients’ data can help all stakeholders such as the health agencies, the society, and government, to know how the symptoms contribute to COVID-19 in certain demographics. For example, we can observe if certain demographics have some symptom patterns different from others.

The size of data required to implement the proposed methodology in this paper at a national or state level is significantly large; however, within the socially responsible modeling, computation, and design (SoReMo) research context, the dataset for Cook County in Illinois is used in forecasting COVID-19 spread and developing a probabilistic graphical model on patients’ data. The rest of this paper is organized as follows. The next section is the related works, followed by the methodology. Next comes the experimental settings, which is followed by the results and discussions. Lastly, we have the conclusion. To facilitate reader’s convenience the list of acronyms used in this paper is summarized in Table 1.

Table 1. List of acronyms
Acronym Meaning
ARIMA Auto-Regressive Integrated Moving Average
BN Bayesian Network
CDC Centers for Disease Control
COVID-19 Coronavirus Disease 2019
CPD Conditional Probability Distribution
DAG Directed Acyclic Graph
ML Machine Learning
MLE Maximum Likelihood Estimation
PGM Probabilistic Graphical Model
RMSE Root Mean Square Error
SoReMo Socially Responsible Modeling, Computation, and Design

Methodology

In this section, the proposed methods for the data-driven modeling of COVID-19 are presented. The section begins with the definition of probabilistic graphical models (PGMs), and clustering of data with ML, then follows the time-series forecasting of the disease. It is worthy to note that the same dataset is used for the PGM and clustering, while a separate time-series dataset is used for the forecasting, as described in context.

Probabilistic Graphical Models

The knowledge of probability theory provides the framework for assessing multiple variables and their likelihood; this makes the modeling of complex reasoning more faithful to reality. The reasoning task is to estimate the probability of one or more variables given the observation of other variables. The joint probability distribution over the space of possible states (where each variable has one or multiple states) for the set of random variables \(\mathcal{V}\) need to be constructed. However, computing the joint distribution over a set of dozens or hundreds of random variables is a daunting task, but PGMs can be used to simplify the construction of complex distributions. PGMs as the name implies use a graph-based representation to encode the conditional (in)dependencies between random variables.The graph depicts the causal relationships that exist between variables in the distribution. Consider a distribution modeled as a graph \(\mathcal{G}(\mathcal{V}, \mathcal{E})\) where \(\mathcal{V}\) is the set of variables and \(\mathcal{E}\) is the set of edges; the variable interactions can be expressed in the form \((A \perp B \mid C)\), where \(A, B, C \in \mathcal{V}\) and \((C,A) \in \mathcal{E}\).

The joint probability distribution over a set of variables \(\mathcal{V} = \{V_1, …, V_n\}\) can be expressed as \(P(V_1 = v^k_1, …, V_n = v^k_n)\) where \(v^k_i \in \mathcal{S}_i\) and \(\mathcal{S}_i\) is the set of states for variable \(V_i\). With the graph structure, the joint distribution is decomposed into smaller factors based on the conditional independencies among variables. The joint distribution can be defined as a product of the factors as:

\[\begin{equation} P(V_1, …, V_n) = \prod_{i=1}^{n}P(V_i \mid Pa(\hat{V_i})) \end{equation}\] where \(n\) is the total number of variables in \(\mathcal{G}\), while \(V_i\) and \(Pa(\hat{V_i})\) represent a variable in \(\mathcal{V}\) and the ‘parents’ of the variable respectively. This is also referred to as the conditional probability distribution of the variable set.

Bayesian Networks

To compute the probability distributions from the graph structure, there are two main approaches. One is the Bayesian networks in which case the edges are directed as parent \(\rightarrow\) child, and the other is Markov networks in which case the edges are undirected. Both networks differ in the set of independencies they encode and in the induced factorization from the distribution. Figure 1 shows a three-variable Bayesian network example with the independence assertion that \((A \perp B \mid C)\), where \(A\) is a child of \(C\) and is independent of \(B\). This can be interpreted as the Markov condition (Tsagris 2019) which states that each variable is independent of its non-descendants given its parent(s). In this example, \(B\) is not a descendant of \(A\), and the edge \(C \rightarrow A\) exists if \(C\) is a direct cause of \(A\). Probabilistically, the conditional independence holds if \(P(A,B \mid C)=P(A \mid C)P(B \mid C)\).

Formally, a Bayesian network or BN for short (Pearl 1988), is a directed acyclic graph (DAG) \(\mathcal{G}\) over variables \(V_i \in \mathcal{V}\) and a joint probability distribution \(P\). The directed graph is acyclic because no cycles are allowed. From a BN, any probability of the form \(P(V_i \mid Pa(\hat{V_i}))\) can be determined, i.e the conditional probability on each variable can be computed. Note that if a variable has no parent, only the marginal probability \(P(V_i)\) is computed. In practice, a BN is constructed from a dataset of samples, and once a feasible structure is found, we can run inference on the dataset.

Fig. 1 An example of a Bayesian network structure illustrating conditional independence

Constructing a BN requires learning the conditional independencies in the dataset. There are several structure learning algorithms which search the best graphical structure from the dataset. There are two broad structure learning techniques – score-based structure learning and constraint-based structure learning. For both techniques, a learning algorithm traverses the search space of possible DAGs and selects the one with the optimal score. The most popular ones of these algorithms include the Exhaustive search, HillClimb search, PC algorithm (Tsagris 2019) and MMHC algorithm (Tsamardinos, Brown, and Aliferis 2006). When a DAG has been found, the important task is to estimate the conditional probability distributions (CPDs), defined as the set of parameter values \(\theta_\mathcal{G}\) for the variables \(V_i \in \mathcal{V}\). For each variable \(V_i\), there is a separate multinomial probability distribution over every possible parent’s state \(v^k_i\). Thus, each variable’s CPD is considered a full table.

Estimating the CPD for each variable naturally follows the use of relative frequencies with respect to the occurrence of the variable states; this approach is called maximum likelihood estimation (MLE). The MLE has the problem of overfitting the data when there are not enough observations in the data. In contrast, the Bayesian parameter estimator starts with some pseudo state counts before observing the data - this reduces the tendency of overfitting or biased estimation.

Data Clustering

Clustering of data is important in knowing the hidden structure of the data, because it is often a complex task to superficially identify similarities between data points in a high-dimensional space and in a large variance data distribution. The goal of clustering is to identify subgroups in the data in a manner that data points that have similar properties or patterns belong to the same cluster, while data points that are not similar belong to different clusters. Clustering algorithms can help find such similarities in the data by using the Euclidean-distance function to determine the assignment of data points to appropriate clusters. Unlike supervised learning, clustering is an unsupervised learning task since no one knows the ground truth to compare and evaluate with the outcomes from clustering (Dabbura 2022).

The most popular and widely used clustering algorithm is the Kmeans algorithm due to its simplicity and robustness in fitting data points. As the name suggests, the Kmeans algorithm partitions the dataset into \(K\) pre-defined clusters where each data point belongs to only one cluster. The clustering algorithm tries to minimize an objective function which is the sum of squared error (SSE) between the data points and the cluster centers as follows:

\[\begin{equation} \underset{\mu_1,\cdots,\mu_K \in Z}{\text{minimize}} \quad \sum_{k=1}^K \sum_{z \in Z_k} \lVert z - \mu_k \rVert^2 \end{equation}\] where \(Z_k\) is the set of data points that are closest to center point \(\mu_k\) more than other data points.

Specifically, the algorithm tries to minimize the intra-cluster distance and maximize the inter-cluster distance, while keeping the SSE as low as possible. The Kmeans algorithm works by initializing \(K\) centroids (i.e cluster centers) randomly and assigning data points to the centroids, this continues to iterate until there is no change in the assignment of data points to centroids, in which case, the algorithm converges. A drawback with this algorithm is that, since the initialization of centroids is random, the algorithm may be stuck in a local minimum and not reach the global minimum, as different initializations usually lead to different clusters. This problem has been solved with the Kmeans++ algorithm (Arthur and Vassilvitskii 2006) which works similarly as the Kmeans algorithm but differs in the centroid initialization process, i.e centroids are initialized smartly, and the quality of clustering is improved.

Time-series Forecasting

The significance of learning from data helps in decision-making, and PGMs can be used to answer many questions of uncertainty from data. However, we need to be able to make predictions about the future from historical data. Machine learning is a powerful tool for such tasks as it helps in forecasting, predicting, and extracting meaningful information from historical data. In several data science fields, time-series forecasting has been considered a pertinent regression task as it pertains to the sequence of observations in time intervals, be it hourly, daily, weekly, monthly, or yearly. The regression task in time-series forecasting is to predict the future values based on current and previous observations. While this may seem to be an interesting, and perhaps simple task, it is by no means trivial, as it requires an understanding of the components of the time-series data such as trends, seasonality, irregularity, and cyclicity (“Using Machine Learning for Time Series Forecasting Project,” n.d.). These components relate to the pattern repetition of cycles (seasonality), increase or decrease in the behavior of the observations (trend) and repetitive changes in the time-series (cyclicity).

Several traditional approaches such as exponential smoothing or ARIMA have been applied to forecasting, but ML approaches have significantly achieved higher accuracy for the prediction task. With ML, the input is formulated in a supervised learning manner, where the input can be designed using time-delay embedding techniques, i.e for an instance \(i\), the feature vector is \(\mathbf{x^{(i)}}=(y^{(i-1)}, …, y^{(i-N)})\) where \(y^{(i-N)}\) is the output of the \(Nth\)-step instance before the \(i\)th instance. The goal of learning is to find the best model that fits the data such that \(\hat{y}^{(i)} = y^{(i)}\), where \(\hat{y}^{(i)}\) and \(y^{(i)}\) are the predicted value and target value respectively.

The loss function for the regression model is defined as the root mean square error (RMSE), which is the standard deviation of the prediction errors. It is interpreted as, the lower the better:

\[\begin{equation} RMSE = \sqrt{\frac{1}{M} \sum_{i=1}^{M}(\hat{y}^{(i)}-y^{(i)})^2} \end{equation}\] where \(M\) is the sample size of the dataset.

Among the popular regression models with strong capability for time-series forecasting are the random forest regression algorithm and linear regression model. The performance comparison for both algorithms on the COVID-19 time-series dataset is presented in the results and discussion section.

Experimental Settings

In this section, the steps employed in the design and implementation of the proposed methods are presented. For each subsection, the dataset used for the experiments is specified without ambiguity.

Probabilistic Graph Structure Learning on CDC Dataset

The scope of this work is limited to COVID-19 cases in Chicago, the largest city in Cook County, Illinois. Two separate datasets are used for the time-series forecasting and BN with data clustering. For the forecasting, a publicly available COVID-19 dataset in Google Health COVID-19 Open Data Repository (“Covid-19 Open Data - Google Health,” n.d.) is used (the dataset consists of the record of the number of daily confirmed cases and death cases from January 25, 2020, to May 13, 2022), while for the BN and data clustering, the CDC COVID-19 Case Surveillance Data is used (“Covid-19 Case Surveillance Restricted Access Detailed Data” 2021). Note that the CDC does not take responsibility for the scientific validity or accuracy of methodology, results, or statistical analysis (“Covid-19 Case Surveillance Restricted Access Detailed Data” 2021).

The CDC dataset consists of features that are sub-classified into symptoms, demographic information, and severity, for patients who are infected by COVID-19. There are a total of 24 features in the dataset and only 7,517 samples with these features are used for the Chicago case study. Table 2 shows the alphabetic coding of each variable, and their cardinality in the dataset (where cardinality refers to the number of variable states). The final six variables without codes in Table 2 are removed from the processed dataset because their values are either ‘missing’ or ‘unknown’ for most of the cases. Also, since there is a strong correlation between whether a patient is hospitalized and whether the patient is taken to the intensive care unit (ICU), to avoid redundant information, the variable ‘hospitalized’ is removed (this helps the BN to find a good structure). This is because a patient who is in the ICU is also hospitalized, and for many instances in the dataset, patients who are hospitalized are taken to the ICU. Hence, a total of 17 variables are used in constructing the BN. All of the variables except age, sex, and race, have two binary states – yes(1) and no(0). These binary values (alongside the variable codes) will be used to interpret the CPD of the variables in the results and discussion section.

Table 2. Dataset features description
Code Variable Cardinality Type
A abdominal pain 2 symptom
B abnormal chest X-ray 2 symptom
C age group 9 demographic
D chills 2 symptom
E cough 2 symptom
F death 2 severity
G diarrhea 2 symptom
H fever 2 symptom
I headache 2 symptom
K intensive care unit 2 severity
L mechanical ventilation 2 severity
M nausea or vomiting 2 symptom
N pneumonia 2 symptom
O race ethnicity combined 6 demographic
P sex 2 demographic
Q shortness of breath 2 symptom
R sore throat 2 symptom
hospitalized 2 severity
acute respiratory distress syndrome 2 symptom
subjective fever 2 symptom
health care worker 2 demographic
medical condition 2 severity
runny nose 2 symptom
muscle aches 2 symptom

To learn the BN from the dataset requires the selection of a search strategy, i.e an algorithm that finds the best structure that encodes the conditional independencies in the data. The search space of DAGs is always super-exponential with respect to the number of variables in the dataset. The exhaustive search algorithm always attempts to identify the ideal structure, but only if the dataset consists of no more than five variables (i.e nodes). Since this is not realistic in practice, as we need to handle data of high-dimensional variable space, a heuristic search algorithm helps to find a good structure that implements a greedy local search and terminates when a local maximum is reached. Figure 2 shows the constructed BN for the CDC COVID-19 dataset using the heuristic Hill-Climb search algorithm. The algorithm returns a DAG with 17 nodes and 47 edges.

Fig. 2 Directed acyclic graph using Hill-Climb search algorithm

The method used for learning the parameters, i.e estimating the CPDs for each variable in the dataset, is the Bayesian parameter estimation method. The MLE which uses the relative frequencies has the problem of overfitting the data and is considered not to benefit the objective of optimal CPDs. The following lists show the BN’s output of the causal relationship between each variable and its parents, as well as the arrow pointing to the local independencies associated with that variable, if any. The value in front of each variable represents the number of states. For example, A:2 means the variable \(A\) has two possible states.

P(A:2)

P(G:2 | A:2)

P(M:2 | A:2, E:2, G:2, K:2) \(\rightarrow\) (M ⟂ Q, N, B, H, L | A, G, K, E)

P(B:2 | A:2, G:2, L:2)

P(N:2 | B:2, H:2, K:2, L:2, Q:2) \(\rightarrow\) (N ⟂ A, R, G, D, E, M | K, Q, B, H, L)

P(Q:2 | B:2, G:2, K:2) \(\rightarrow\) (Q ⟂ A, L | K, G, B)

P(H:2 | B:2, E:2, K:2, L:2) \(\rightarrow\) (H ⟂ A, R, G, Q, M | K, B, E, L)

P(E:2 | B:2, K:2, Q:2) \(\rightarrow\) (E ⟂ A, G, L | K, B, Q)

P(K:2 | B:2, G:2, L:2) \(\rightarrow\) (K ⟂ A | G, B, L)

P(C:9 | I:2, K:2, Q:2) \(\rightarrow\) (C ⟂ R, G, B, E, L, M, A, D, F, N, H | I, K, Q)

P(O:6 | C:9) \(\rightarrow\) (O ⟂ R, G, Q, B, E, L, M, A, D, K, F, N, I, H | C)

P(P:2 | C:9, O:6) \(\rightarrow\) (P ⟂ R, G, Q, B, E, L, M, A, D, K, F, N, I, H | C, O)

P(D:2 | G:2, H:2, M:2, R:2) \(\rightarrow\) (D ⟂ A, K, Q, N, B, E, L | R, G, H, M)

P(I:2 | D:2, G:2, N:2, R:2) \(\rightarrow\) (I ⟂ Q, B, E, L, M, A, K, H | R, G, N, D)

P(R:2 | E:2, M:2, Q:2) \(\rightarrow\) (R ⟂ A, G, K, N, B, H, L | E, Q, M)

P(L:2 | G:2) \(\rightarrow\) (L ⟂ A | G)

P(F:2 | I:2, K:2, N:2) \(\rightarrow\) (F ⟂ R, G, Q, B, E, C, L, M, A, D, P, H, O | I, K, N)

The causal relationships between variables can guide us in knowing how to run inference using the query (child) variable and evidence variables (parents). For example, from the lists, we can observe that abdominal pain (A), diarrhea (G), and abnormal chest X-ray (B) have no local independencies, and that abdominal pain (A) is not influenced by any variable, but it causes diarrhea (G), nausea vomit (M) and abnormal chest X-ray (B).

Kmeans++ Clustering on CDC Dataset

Further to the probabilistic graph structure learning, there is the need to extract meaningful knowledge such as symptoms pattern from the dataset. Running the Kmeans++ algorithm requires that the number of \(K\) classes be specified. However, the major challenge is knowing what \(K\) value will give the best clustering result, as choosing the wrong \(K\) value can cause the data points to be misclassified. In designing the Kmeans++ algorithm, the elbow method (Kodinariya and Makwana 2013) is used in the experiments to serve as a guide in the selection of the possible \(K\) value for optimal clustering.

Considering the context of the clustering, which is to find symptoms patterns in the dataset, the \(K\) value suggested by the elbow method is used for the initial clustering, and then an optimal \(\hat{K}\) value is computed as:

\[\begin{equation} \hat{K} = \sum_{i=1}^{K} n_i(P)n_i(O)n_i(C) \end{equation}\]

where \(n_i(P)\), \(n_i(O)\), and \(n_i(C)\) are the cardinality of the demographic variables - sex, race, and age respectively, in class \(i\).

The clustering algorithm outputs the data points and their associated class labels. As part of the symptoms pattern learning process, the labeled data is used to train a deep neural network model in a supervised learning manner. The deep neural network architecture with an input layer, consists of four hidden layers and an output layer with ReLU activation and SoftMax activation respectively. The configurations of the hyperparameters for training the deep neural network model are listed in Table 3. The dimension m by n of the weight matrix \(\textbf{W}_i\) correspond to number of neurons by number of input features.

Table 3. Configuration of the deep neural network
Name Value
Epochs 500
Batch size 50
Layers 5
Optimizer Adam
Regularization Early stopping
W1 64x11
W2, W3 64x64, 128x64
W4, W5 128x128, 124x128

Regression Models for COVID-19 Time-series Dataset

The first step in designing a regression model for time-series forecasting is to know the properties of the data, e.g whether it is stationary or not. This understanding provides proper guidance on feature preprocessing steps and the parameter settings for a regression model. The COVID-19 time-series dataset, which captures the trend of daily confirmed cases and death cases is shown in Figure 3. There is a trend in the daily death cases, as more death cases are recorded during the spring and winter season than in the summer season, however, there is no such trend in the daily confirmed cases.

Fig. 3 Left-Right: Trend of daily confirmed cases and daily death cases

With the trend in the time-series data in Figure 3, the random forest, and linear regression models are designed using the time-delay embedding technique. Each input instance in the dataset is a 7-dimensional feature vector which corresponds to the number of cases recorded in the past seven days. This means that the next day depends on the cases seven days prior. Here, the random forest and linear regression models are trained separately using two dataset configurations, i.e date+confirmed cases; and date+deceased cases.

Results and Discussions

This section begins with some of the CPDs learned by the BN, and then proceeds to discuss the performance of the deep neural network model based on the data clustering experience. Finally, the time-series forecasting results are presented.

Conditional Probability Distributions

The output of learning the parameters of the BN is the CPDs for each variable. The Markov condition: “a variable is independent of non-descendants given parents” is used to derive the CPDs. Figure 4 – Figure 13 show the CPDs of each variable, where the row corresponds to the target variable states {yes(1), no(0)} and the columns correspond to the different parents’ state {yes(1), no(0)} configurations. For example, [row1, column1] in Figure 4 is interpreted P(G=0 | A=0) = P(diarrhea=’No’ | abdominal pain=’No’) = 0.90. Variables with more than two states or whose state are not yes(1) and no(0) are clearly defined. The variable ‘age group’ has nine states and it is defined as {(0-9yrs), (10-19yrs), (20-29yrs), (30-39yrs), (40-49yrs), (50-59yrs), (60-69yrs), (70-79yrs), (80+yrs)} = {0, 1, 2, 3, 4, 5, 6, 7, 8}, the variable ‘sex’ has two states and it is defined as {‘Female’, ‘Male’} = {0, 1}, and the variable ‘race and ethnicity’ has six states which is defined as:

{(American Indian/Alaska Native, Non-Hispanic), (Asian Non-Hispanic), (Black Non-Hispanic), (Hispanic/Latino), (Native Hawaiian/Other Pacific Islander), (White Non-Hispanic)} = {0, 1, 2, 3, 4, 5}

Fig. 4 CPD table for diarrhea (G)

The CPD table in Figure 4 shows that a patient is less likely to have diarrhea (G) if there is no diagnosis of abdominal pain (A) in the patient and vice-versa.

Fig. 5 CPD table for abnormal chest X-ray (B)

Whether a patient suffers from abnormal chest X-ray (B) or not depends on abdominal pain (A), diarrhea (G) and use of mechanical ventilation (L) as shown in Figure 5. There is a high probability (up to 0.94) that a patient has abnormal chest pain if the patient is assisted with the mechanical ventilation even with or without the presence of diarrhea and abdominal pain.

Fig. 6 CPD table for age group (C)

As shown in Figure 6, we can observe the CPD of the age group (C). For example, patients who are diagnosed of COVID-19 and fall within the age ranges (0-9 years), (10-19 years), (20-29 years) and (30-39 years) are less affected by headache (I), use of ICU (K) and shortness of breadth (Q), while the older population have a high probability of suffering from COVID-19 given that they are diagnosed of headache and/or are taken to the ICU for treatment.

Fig. 7 CPD table for cough (E)

A patient is most likely to have cough (E) if diagnosed of abnormal chest X-ray (B), shortness of breadth (Q) and already taken to the ICU (K). It is possible that if the patient is treated of these symptoms, the cough symptom will either be reduced or prevented. For example, as shown in Figure 7, the probability of having cough is as high as 0.86 when these parent symptoms are present in a patient, and in their absence the cough probability can be as low as 0.34.

Fig. 8 CPD table for death (F)

As shown in Figure 8, whether a patient will die of COVID-19 or not depends on headache (I), ICU (K) and pneumonia (N). The diagnosis of headache or use of ICU or both in a patient can cause a death probability of up to 0.61, 0.98 and 0.75 respectively. This means that a patient taken to the ICU is likely to die with a probability of 0.98, while a COVID-19 patient that has headache has a death probability of 0.61. The extreme case is when a patient has pneumonia, the death probability is as high as 0.96 and 0.99. This reflects the severity of these symptoms in causing COVID-19 death and can help the health offices such as clinics, medical laboratories and hospitals to administer emergency treatment to a patient with headache and/or pneumonia during the COVID-19 pandemic. Individuals themselves can use these symptoms as an indicator to seek immediate medical help.

Fig. 9 CPD table for intensive care unit (K)

The ICU (K) variable depends on abdominal chest X-ray (B), diarrhea (G) and mechanical ventilation (L). The CPD table in Figure 9 reveals that there is a high probability (up to 0.9) for the use of an ICU if a patient can’t breathe properly and needs the mechanical ventilation to breathe. The probability increases to 0.99 if the patient is diagnosed of abdominal chest X-ray and diarrhea.

Fig. 10 CPD table for mechanical ventilation (L)

Figure 10 supports the CPD table in Figure 9. As seen, the use of the mechanical ventilation (L) does not strongly depend on the diagnosis of diarrhea (G) in a patient.

Fig. 11 CPD table for race and ethnicity (O)

The outcome of the BN structure shown by the CPD table in Figure 11 does not accurately reveal a strong relationship between a patient’s race and ethnicity (O) and the age group (C). This is expected because there is no such rule or process to detect a person’s race from their age. We expect that these variables should be independent of each other, but the BN has learned the parameters from the ancestors of the parent variables.

Fig. 12 CPD table for shortness of breadth (Q)

Whether a patient has shortness of breadth (Q) or not depends on if the patient has abnormal chest X-ray (B), diarrhea (G) and is taken to the ICU (K). From Figure 12, the presence of abnormal chest X-ray greatly causes shortness of breadth in a patient. When it is present, the probability is always high compared to when it is low. In combination with the remaining two parent variables, the probability that a patient has shortness of breadth is as high as 0.94.

Fig. 13 CPD table for sore throat (R)

The CPD table in Figure 13 shows the effect of cough (E), nausea vomit (M) and shortness of breadth (Q) on sore throat (R). There is a low probability that a patient has sore throat given the parent symptoms. Note that this is as a result of the samples in the dataset, which does not capture the actual distribution over a large population.

Efficiency of Clustering

The learning of the symptoms pattern in the CDC COVID-19 dataset is efficient, as we can observe from Figure 14 where the top-3 principal components of the data points are projected in 3D for six classes for ease of visualization. Recall, the data points correspond to patients’ symptoms vector, and we can see that based on clustering, the data points are separated into distinguishable classes without overlapping. This means that there is a hidden structure in the data which can be discovered through data clustering.

Fig. 14 Illustration of the clustering of patients’ samples

To show that the clustering result is efficient, a deep neural network model is used to train the labeled clustered data which was obtained from clustering. For this task, the CDC dataset is split into training (80%), validation (15%), and testing (5%) sets. The model’s performance on the testing set is 99.47%. This means that when given a patient’s symptoms, the deep neural network model can predict the symptoms class with high accuracy. Figure 15 shows the accuracy (left) and loss (right) plots of the deep neural network model over the training epochs.

The symptoms structure in the top-5 classes from the clustering are presented in Table 4, which represents the most observed combination of patients’ symptoms in each class. For example, the patients in class 0 are Black Non-Hispanic that are 80+ years and are mostly diagnosed of abnormal chest X-ray, pneumonia and shortness of breadth, while the patients in class 3 are White Non-Hispanic that are within 70 years to 79 years and are mostly diagnosed of abnormal chest X-ray, cough, fever, pneumonia and shortness of breadth.

Fig. 15 Deep neural network model performance on clustered data

Table 4. Demographic clustering of patients symptoms
Class C O P A B D E G H I M N Q R
0 8 2 0 0 1 0 0 0 0 0 0 1 1 0
1 7 3 1 0 1 1 1 0 1 0 0 1 1 0
2 3 3 1 0 1 0 1 1 0 0 0 1 1 0
3 7 5 0 0 1 0 1 0 1 0 0 1 1 0
4 6 3 1 0 0 0 1 0 0 0 0 0 1 0

Performance Comparison of Time-series Forecasting Models

For the COVID-19 time-series dataset, the random forest and linear regression models are compared with the traditional ARIMA model in predicting the number of future confirmed cases and death (deceased) cases. The ARIMA model uses a p-value for auto regression, d-value for differencing, and q-value for moving average, and from observation in the experiment, it is best configured as (p,d,q) = (10,1,5) and (p,d,q) = (10,1,3) for the date+confirmed cases and date+deceased cases datasets respectively. The datasets are split into 0.9/0.1 for training/testing respectively.

To evaluate performance of the three models, the logarithm function of the RMSE is used (the lower the value, the better). As shown in Figure 16, the \(log(RMSE)\) value for the confirmed cases is high for all the models, while for the deceased cases, the \(log(RMSE)\) value is low for all the models because of the repeating trend in the dataset which is obvious in Figure 3 (right). On the date+confirmed cases dataset, the linear regression and random forest models significantly outperform the ARIMA model, which justifies the capability of ML over statistical or mathematical models in fitting complex time-series data.

Fig 16. Comparing model performance on time-series data

The random forest model and linear regression model can be smartly used to forecast the future impact of COVID-19 in any municipality and can also help to predict the stopping time of the virus if sufficient and accurate historical data is available.

Research Impact to SoReMo Mission

With the proposed methodology of ML with BN structure and parameter learning (i.e CPD) from the COVID-19 dataset, there is so much information to extract that can help the society, government and health agencies to be more proactive to the pandemic. First, as shown in this paper, the society can now be aware of the relationship between symptoms that cause the COVID-19 disease and their likelihood. This means that when a person experiences certain symptoms, they can predict the disease or other symptoms that can cause it and the probability. Second, regarding the death variable in the dataset, when a patient suffers from pneumonia there is need to immediately take the patient to an ICU with mechanical ventilation to save the patient. Third, policy makers can provide certain demographic health-related guidelines to patients based on the clustering result in this paper which shows that patients within the same age group and race have similar symptoms responsible for COVID-19. This can also help in the administering of vaccines to diverse groups within the population, especially in a city like Chicago. Lastly, all the results from the BN learning, data clustering and time-series forecasting can help medical professionals to effectively study other diseases that cause respiratory disorder or death.

In the context of SoReMo mission, this paper has attempted to improve on the health and safety awareness of the people of Chicago by employing a detailed modeling and algorithm design for COVID-19. This is a continuing work, as data of other diseases related to COVID-19 will be analyzed in future works and explained in a similar fashion.

Conclusion

In this paper, a data-driven approach for discovering the hidden knowledge about the COVID-19 pandemic in Chicago has been proposed. The social issues identified during the SoReMo research project has been addressed in this paper. The applicability of probabilistic graphical models and machine learning can provide the major stakeholders, i.e society, government and health agencies, with useful information and insights on the direction to consider in stopping the spread of the virus, and promoting the health and wellness of the society. The python code used for this research is publicly made available online.

License

The author of this technical report, which was written as a deliverable for a SoReMo project, retains the copyright of the written material herein upon publication of this document in SoReMo Reports.

References

Abdul Salam, Mustafa, Sanaa Taha, and Mohamed Ramadan. 2021. “COVID-19 Detection Using Federated Machine Learning.” PLoS One 16 (6): e0252573.
Adetiba, Emmanuel, Joshua A Abolarinwa, Anthony A Adegoke, Tunmike B Taiwo, Oluwaseun T Ajayi, Abdultaofeek Abayomi, Joy N Adetiba, and Joke A Badejo. 2022. “DeepCOVID-19: A Model for Identification of COVID-19 Virus Sequences with Genomic Signal Processing and Deep Learning.” Cogent Engineering 9 (1): 2017580.
Ahmar, Ansari Saleh, and Eva Boj Del Val. 2020. “SutteARIMA: Short-Term Forecasting Method, a Case: Covid-19 and Stock Market in Spain.” Science of the Total Environment 729: 138883.
Al-Qaness, Mohammed AA, Ahmed A Ewees, Hong Fan, and Mohamed Abd El Aziz. 2020. “Optimization Method for Forecasting Confirmed Cases of COVID-19 in China.” Journal of Clinical Medicine 9 (3): 674.
Alsuwat, Emad, Sabah Alzahrani, and Hatim Alsuwat. 2021. “Detecting COVID-19 Utilizing Probabilistic Graphical Models.” International Journal of Advanced Computer Science and Applications 12 (6).
Amon, Joseph J. 2020. “COVID-19 and Detention: Respecting Human Rights.” Health and Human Rights 22 (1): 367.
Arthur, David, and Sergei Vassilvitskii. 2006. “K-Means++: The Advantages of Careful Seeding.” Stanford.
“CDC Covid Data Tracker.” n.d. Centers for Disease Control and Prevention. Centers for Disease Control; Prevention. https://covid.cdc.gov/covid-data-tracker/#datatracker-home.
Chimmula, Vinay Kumar Reddy, and Lei Zhang. 2020. “Time Series Forecasting of COVID-19 Transmission in Canada Using LSTM Networks.” Chaos, Solitons & Fractals 135: 109864.
Chintalapudi, Nalini, Gopi Battineni, and Francesco Amenta. 2020. “COVID-19 Virus Outbreak Forecasting of Registered and Recovered Cases After Sixty Day Lockdown in Italy: A Data Driven Model Approach.” Journal of Microbiology, Immunology and Infection 53 (3): 396–403.
“Covid-19 Case Surveillance Restricted Access Detailed Data.” 2021. Catalog. Publisher CDC. https://catalog.data.gov/dataset/covid-19-case-surveillance-restricted-access-detailed-data.
“Covid-19 Open Data - Google Health.” n.d. Google. Google. https://health.google.com/covid-19/open-data/.
“Covid-19 Vaccines.” n.d. World Health Organization. World Health Organization. https://www.who.int/emergencies/diseases/novel-coronavirus-2019/covid-19-vaccines.
Dabbura, Imad. 2022. “K-Means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks.” Medium. Towards Data Science. https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a.
Fong, Simon James, Gloria Li, Nilanjan Dey, Rubén González Crespo, and Enrique Herrera-Viedma. 2020. “Finding an Accurate Early Forecasting Model from Small Dataset: A Case of 2019-Ncov Novel Coronavirus Outbreak.” arXiv Preprint arXiv:2003.10776.
Irawati, Mesayu Elida, and Hasballah Zakaria. 2021. “Classification Model for Covid-19 Detection Through Recording of Cough Using XGboost Classifier Algorithm.” In 2021 International Symposium on Electronics and Smart Devices (ISESD), 1–5. IEEE.
Kodinariya, Trupti M, and Prashant R Makwana. 2013. “Review on Determining Number of Cluster in k-Means Clustering.” International Journal 1 (6): 90–95.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2017. “Imagenet Classification with Deep Convolutional Neural Networks.” Communications of the ACM 60 (6): 84–90.
Painuli, Deepak, Divya Mishra, Suyash Bhardwaj, and Mayank Aggarwal. 2021. “Forecast and Prediction of COVID-19 Using Machine Learning.” In Data Science for COVID-19, 381–97. Elsevier.
Pearl, Judea. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan kaufmann.
Saadat, Saeida, Deepak Rawtani, and Chaudhery Mustansar Hussain. 2020. “Environmental Perspective of COVID-19.” Science of the Total Environment 728: 138870.
Tsagris, Michail. 2019. “Bayesian Network Learning with the PC Algorithm: An Improved and Correct Variation.” Applied Artificial Intelligence 33 (2): 101–23.
Tsamardinos, Ioannis, Laura E Brown, and Constantin F Aliferis. 2006. “The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm.” Machine Learning 65 (1): 31–78.
“Using Machine Learning for Time Series Forecasting Project.” n.d. CodeIT. https://codeit.us/blog/machine-learning-time-series-forecasting.
Wang, Leo, Haiying Shen, Kyle Enfield, and Karen Rheuban. 2021. “Covid-19 Infection Detection Using Machine Learning.” In 2021 IEEE International Conference on Big Data (Big Data), 4780–89. IEEE.
Yan, Bingjie, Jun Wang, Jieren Cheng, Yize Zhou, Yixian Zhang, Yifan Yang, Li Liu, Haojiang Zhao, Chunjuan Wang, and Boyi Liu. 2021. “Experiments of Federated Learning for COVID-19 Chest x-Ray Images.” In International Conference on Artificial Intelligence and Security, 41–53. Springer.