During the COVID pandemic, it was speculated that patients with the virus who were smoking-related might have a lower likelihood of disease exacerbation or death. To assess whether there is an association between smoking and risk of in-hospital mortality, SAVANA's big data and Natural Language Processing (NLP) technology is used.
MethodA retrospective, observational, non-interventional cohort study was conducted based on real-life data extracted from medical records throughout Castilla La Mancha using Natural Language Processing and Artificial Intelligence techniques developed by SAVANA. The study covered the entire population of this region with Electronic Medical Records in SESCAM presenting with a diagnosis of COVID from March 1, 2020 to February 28, 2021.
ResultsSmokers had a significantly higher percentage of cardiovascular risk factors (hypertension, dyslipidemia and diabetes), COPD, asthma, IDP, IC, CVD, PTE, cancer in general and lung cancer in particular, bronchiectasis, heart failure and a history of pneumonia (p < 0.0001).Former smokers, current smokers and non-smokers have a significant age difference. As for in-hospital deaths, they were more frequent in the case of ex-smokers, followed by smokers and then non-smokers (p < 0.0001).
ConclusionThere is an increased risk of dying in hospital in SARS-COV2-infected patients who are active smokers or have smoked in the past.
Durante la pandemia de COVID se especuló que los pacientes con el virus que tenían relación con el tabaco podrían tener una menor probabilidad de agravamiento de la enfermedad o muerte. Para evaluar si existe una asociación entre el tabaquismo y el riesgo de mortalidad intrahospitalaria, se utiliza la tecnología de big data y Procesamiento del Lenguaje Natural (PLN) de SAVANA.
MétodoSe llevó a cabo un estudio de cohortes retrospectivo, observacional y sin intervención basado en datos de vida real extraídos de registros médicos de toda Castilla La Mancha utilizando las técnicas de Procesamiento del Lenguaje Natural e Inteligencia Artificial desarrolladas por SAVANA. El estudio abarcó toda la población de esta Comunidad con Historia Clínica Electrónica en SESCAM que presentara diagnóstico de COVID desde el 1 de marzo de 2020 al 28 de febrero de 2021.
ResultadosLos fumadores tienen mayor porcentaje de factores de riesgo cardiovascular (hipertensión arterial, dislipemia y diabetes), EPOC, asma, EPID, CI, ECV, TEP, cáncer en general y cáncer de pulmón en particular, bronquiectasias, insuficiencia cardíaca y antecedentes de neumonía, de forma significativa (p < 0,0001). Los pacientes ex-fumadores, fumadores y no fumadores tienen una diferencia de edad significativa. En cuanto a las muertes hospitalarias, fueron más frecuentes en el caso de los exfumadores, siguiendo los fumadores y luego los no fumadores (p < 0.0001).
ConclusiónExiste un mayor riesgo de mortalidad intrahospitalaria en los pacientes infectados por SARS-COV2 y que sean fumadores activos o hayan fumado en el pasado.
The SARS-COV2 virus is a coronavirus that began to cause pneumonia in Wuhan (China) in December 2019.1 Its diagnosis is made mainly by clinical evaluation and detection of the virus by the polymerase chain reaction (PCR) in different secretions.2 On January 30, 2020, the World Health Organization (WHO) declared COVID (SARS-COV2 disease) a public health emergency of international concern, declaring it an epidemic.3 In a few weeks, it became a pandemic,4 in which Italy was completely overcome by the virus,5 being the closest example of what would happen in Spain. Subsequently, the virus spread rapidly throughout Spanish territory.6 It affected all communities in the national territory, although variations in the infection rate and mortality were observed depending on the region. Castilla-La Mancha, in particular, was one of the regions with the highest mortality rate and highest incidence of cases per 100,000 inhabitants,7 which meant a significant workload for health professionals in this region (comunidad autónoma-autonomous community), in which professionals had to dedicate significant resources and efforts only for the caring of coronavirus patients. Furthermore, the few protection measures8 and the changes in protocols, despite the fact that scientific societies spoke out early on the issue,9,10 contributed to the high rate of infections in Spain in general and in Castilla-La Mancha in particular.11
During this situation, different risk factors were identified: age, male sex and high blood pressure among them.13 The relationship between the disease and smoking was uncertain, and various hypotheses were raised about the possible therapeutic efficacy of nicotine or its potential defensive role against COVID-19.12 Different meta-analyses did not completely clarify whether tobacco acts as an aggravating or protective factor against this disease.13
In this study, SAVANA Manager® v3.0 is used. It is a clinical platform developed by the medical company SAVANA that allows the analysis of the information included in the free text of the Electronic Health Records (EHRs).
Using this technology already applied in SESCAM to investigate how Covid-19 affects patients with asthma,14 it is proposed to clarify how smoking affects patients with SARS-CoV-2 infection at the level of ICU admission and mortality in-hospital infection determining whether it can be considered a protective or aggravating factor.
MethodsStudy designA retrospective, observational and non-interventional study was carried out using SAVANA Manager® v3.0 to capture the information from the free text contained in the EHRs. The EHR data of all patients over 18 years of age diagnosed with COVID between March 1, 2020 and February 28, 2021 from Castilla-La Mancha registered in SESCAM were included in this study.
Ethical considerationsThe study was carried out in accordance with legal and regulatory requirements, as well as with scientific purpose, value and rigor. It also followed generally accepted research practices described in the Good Clinical Practice Guideline, the Declaration of Helsinki in its latest edition and Good Practices of Pharmacoepidemiology.
The study was approved by the Research Ethics Committee of the Albacete University Hospital Complex.
Analysis of dataData was collected from the discharge reports of outpatient consultations, hospitalization and emergencies and pharmacy reports from the different SESCAM hospitals, as well as from the primary care EHRs.
SAVANA allows the extraction, integration and exploitation of unstructured clinical data that exists within the HER.15 To this end, it has developed EHRead®, a technology that applies Natural Language Processing (NLP), machine learning and deep learning to the free text of EHRs, extracting the clinical variables of interest.
Variables
The computation of concepts or terminology considered by SAVANA is based on SNOMED CT that contains codes, concepts, synonyms and definitions16 and is expanded with terms generated by SAVANA.
To meet the objectives of the study, the following structured variables were extracted: Gender, age, ICU admission and days of hospitalization; and the following free text variables included in the terminology: COVID infection, smoking (non-smoker, smoker and ex-smoker), hospital death, high blood pressure (HTN), dyslipidemia, diabetes, chronic obstructive pulmonary disease (COPD), asthma, disease diffuse pulmonary interstitial disease (DILD), ischemic heart disease (IHD), cerebrovascular disease (CVD), pulmonary embolism (PE), cancer, lung cancer, pneumonia, bronchiectasis, heart failure and the use of corticosteroids (Triamcinolone, Dexamethasone, Prednisone, Prednisolone, Hydrocortisone, Paramethasone acetate, Methylprednisolone, Betamethasone, Fludrocortisone, Deflazacort).
Data managementThe entire process begins with the data acquisition phase. IT services download and pseudonymize the data to send it to Savana through a secure file transfer protocol (SFTP).
Once the extraction is carried out, it is processed with EHRead® technology to identify and extract the clinical variables of interest from the free text, thus generating a structured database with all the information added per patient and per episode. If the condition is met in one of the documents for each patient, that condition is taken as affirmed. The diagnoses and personal history have also been grouped in such a way that the patient is considered to meet the condition if the entity is detected as a diagnosis or as a personal history. Those who were in the database as current and former smokers have been identified as smokers.
Savana Manager does not use individual patient EHRs, but rather aggregated clinical information. This technology allows a complete dissociation between the data obtained for the present study and the patient's personal data, due to the pseudonymization of the data and subsequent obtaining of the information in an aggregated form.
The performance and precision of the EHRead technology in identifying the study variables is indicated by the Precision (P), coverage (R) and the F-score. For this study, the metrics obtained for the same models applied to the same data in the context of other studies were reused. These studies come from the same set of hospitals and focus on patients with COPD,17 asthma18 and COVID.14
Statistic analysisA descriptive analysis of the variables was carried out. For qualitative variables, relative and absolute frequencies were used, while for quantitative variables, statistical measures such as the mean and standard deviation were used. In order to examine the possible relationship between the variables, the Chi-square test was used for qualitative variables and Student's t-test for quantitative variables. A significance level of 0.05 was established to determine if the results were statistically significant.
Results293,126 patients with SARS-CoV-2 virus infection were identified among the total number of patients in the population of Castilla-La Mancha with EHR available during the study period (March 1, 2020-February 28, 2021). The linguistic evaluation of the term “SARS-CoV-2 virus infection” obtained, for other studies that used the same entity detection models, a precision, coverage and F-score of 0.99, 0.75 and 0.93 respectively, indicating the correct identification of these patients.14 Patient identification was carried out exclusively through confirmed diagnoses indicated in the clinical history, without resorting to medical inferences to detect cases of infection. This implies that, even if a patient presents symptoms of the disease or a confirmed PCR result for SARS-CoV-2, said patient has not been identified in our population. This limitation is assumed with the possibility that some cases of the disease have been missed.
Table 1 presents the main demographic and clinical characteristics of these 293,126 identified patients. The mean age of the study population was 56.8 years (SD 26.7), being 55.2 years (SD 26.4) for non-smokers, 56.8 (SD 15.3) for smokers and higher in the case of ex-smokers 65.5 (SD 15.3). All of these age differences were significant. 47% of patients with SARS-CoV-2 in SESCAM were men.
Clinical and demographic characteristics of patients diagnosed with COVID.
Total | Non-smoker | Smoker | Former smoker | STATISTICAL (p)>Non-smo-smo/smo-former smo | |
---|---|---|---|---|---|
Patients (n) | 293,126 | 252,388 | 17,646 | 23,092 | |
Age (years) | 56.8 ± 26.7 | 55.2 ± 26.4 | 56.8> ± 15.3 | 65.5 ± 15.3 | p < 0.0001/p < 0.0001 |
Men (n) | 136,573 | 643,034 | 16,567 | 48,832 | |
Hospital indicators | |||||
Deaths in hospital (n) | 7278 | 5243 | 589 | 1446 | p < 0.00001/ p < 0.00001 |
Patients admitted to the ICU (n) | 148 | 99 | 13 | 36 | >p = 0 .029717/p = .017661 |
Average hospitalization (days) | 4.29 | 4.34 | 4.02 | 4.32 | >p < 0.0001/p < 0.0001 |
Comorbidities | |||||
HTA (n) | 75,444 | 51,601 | 8003 | 15,840 | p < 0.00001/ p < 0.00001 |
Dyslipidemia (n) | 61,846 | 41,093 | 7207 | 13,546 | p < 0.00001/ p < 0.00001 |
Diabetes (n) | 53,113 | 34,687 | 6494 | 11,932 | p < 0.00001/ p < 0.00001 |
COPD (n) | 8539 | 2285 | 1881 | 4373 | p < 0.00001/ p < 0.00001 |
ASTHMA (n) | 17,103 | 13,290 | 1829 | 1984 | p < 0.00001/ p < 0.00001 |
ILD (n) | 542 | 259 | 82 | 201 | p < 0.00001/ p < 0.00001 |
Ischemic heart disease (n) | 10,410 | 5541 | 1010 | 3859 | >p < 0.00001/ p < 0.00001 |
CVD (n) | 7148 | 4765 | 668 | 1715 | p < 0.00001/ p < 0.00001 |
TEP (n) | 4121 | 2648 | 460 | 1013 | p < 0.00001/ p < 0.00001 |
Cancer (n) | 5958 | 3329 | 946 | 1683 | p < 0.00001/ p < 0.00001 |
Lung cancer (n) | 904 | 299 | 186 | 419 | p < 0.00001/ p < 0.00001 |
Pneumonia (n) | 29,074 | 19,956 | 2543 | 6575 | p < 0.00001/ p < 0.00001 |
Bronchiectasis (n) | 1814 | 934 | 253 | 627 | p < 0.00001/ p < 0.00001 |
Heart failure (n) | 12,650 | 8408 | 1180 | 3062 | >p< 0.00001/ p < 0.00001 |
Corticosteroids (n) | 31,355 | 24,366 | 3081 | 3908 | p < 0.00001/ p = 0.154707 |
The tests used were Student's t in the case of quantitative variables and Chi-square in qualitative variables.
It was observed that smokers have a higher percentage of cardiovascular risk factors (high blood pressure, dyslipidemia and diabetes), COPD, asthma, ILD, IHD, CVD, PET, cancer in general and lung cancer in particular, bronchiectasis, heart failure and history of pneumonia (Tables 1 and 2). All this significantly (p < 0.0001).
Main results and absolute and transformed comorbidities in percentages, overall and by gender in patients diagnosed with COVID.
TOTAL POPULATION 293,126 | WOMEN (53,4%) | MEN (46,6%) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Total | Non-smoker | Smoker | Former smoker | Total | Non-smoker | Smoker | Former smoker | Total | Non-smoker | smoker | Former smoker | |
Death | 7278>(2.48%) | 5243 (2.08%) | 589 (3.34%) | 1446 (6.26%) | 3575 (2.28%) | 3058 (2.22%) | 337 (2.68%) | 180 (3.05%) | 1718 (2.74%) | 2207 (1.93%) | 256 (5.02%) | 1275 (7.39%) |
ICU | 148 (0.05%) | 99 (0.04%) | 13 (0.07%) | 36 (0.02%) | 50 (0.03%) | 42 (0.03%) | 4 (0.03%) | 4 (0.07%) | 89 (0.07%) | 51 (0.04%) | 7 (0.14%) | 31 (0.18%) |
HBP | 75444 (25.74%) | 51601 (20.45%) | 8003 (45.35%) | 15840 (68.60%) | 38613 (24.67%) | 30664 (22.22%) | 5005 (39.77%) | 2944>(49.82%) | 36823 (26.96%) | 20929 (18.32%) | 2998 (58.74%) | 12896 (74.77%) |
Dyslipidemia | 75444 (21.10%) | 41093 (16.28%) | 7207 (40.84%) | 13546 (58.66%) | 31719 (20.27%) | 26090 (18.91%) | 4542 (36.09%) | 1087 (18.40%) | 30125 (22.06%) | 24688 (21.61%) | 2665 (52.21%) | 2772 (16.07%) |
Diabletes | 53113 (18.12%) | 34687 (13.74%) | 6494 (36.80%) | 11932 (51.67%) | 26175 (16.73%) | 19828 (14.37%) | 4019 (31.93%) | 2328 (39.40%) | 27,098 (19.84%) | 14,970 (13.11%) | 2491 (48.80%) | 9637 (55.88%) |
COPD | 8539 (2.91%) | 2285 (0.91%) | 1881 (10.66%) | 4373 (18.94%) | 1647 (1.05%) | 512 (0.37%) | 706 (5.61%) | 429 (7.26%) | 6917 (5.06%) | 1780 (1.56%) | 1180 (23.12%) | 3857 (22.94%) |
ASTHMA | 17103 (5.83%) | 13290 (5.27%) | 1829 (10.36%) | 1984 (8.59%) | 10541 (6.74%) | 8070 (5.85%) | 1534 (12.19%) | 937 (15.86%) | 6613 (4.84%) | 5263 (4.61%) | 298 (5.84%) | 1052 (6.10%) |
ILD | 542 (0.18%) | 259 (0.10%) | 82 (0.46%) | 201 (0.87% | 241 (0.15%) | 163 (0.12%) | 42 (0.33%) | 36 (0.61%) | 303 (0.22%) | 96 (0.08%) | 40 (0.78%) | 167 (0.97%) |
Ischemic HD | 10410 (3.55%) | 5541 (2.20%) | 1010 (5.72%) | 3859 (16.71%) | 3498 (2.24%) | 2682 (1.94%) | 425 (3.38%) | 391 (6.62%) | 6955 (5.09%) | 2880 (2.52%) | 588 (11.52%) | 3487 (20.22%) |
CVD | 7148 (2.44%) | 4765 (1.89%) | 668 (3.79%) | 1715 (7.43%) | 3466 (2.21%) | 2873 (2.08%) | 358 (2.84%) | 235 (3.98%) | 3706 (2.71%) | 1905 (1.67%) | 316 (6.19%) | 1485 (8.61%) |
PE | 4121 (1.41%) | 2648 (1.05%) | 460 (2.61%) | 1013 (4.39%) | 2114 (1.35%) | 1659 (1.20%) | 272 (2.16%) | 183 (3.10%) | 2018 (1.48%) | 994 (0.87%) | 190 (3.72%) | 834 (4.84%) |
Cancer | 5958 (2.03%) | 3329 (1.32%) | 946 (5.36%) | 1683 (7.23%) | 2879 (1.84%) | 1987 (1.44%) | 557 (4.43%) | 335 (5.67%) | 3089 (2.26%) | 1344 (1.18%) | 393 (7.70%) | 1352 (7.84%) |
Lung cancer | 904 (0.31%) | 299 (0.12%) | 186 (1.05%) | 419 (1.81%) | 260 (0.17%) | 145 (0.11%) | 69 (0.55%) | 46 (0.78%) | 646 (0.47%) | 155 (0.14%) | 117 (2.29%) | 374 (2.17%) |
Pneumonia | 29074 (9.92%) | 19956 (7.91%) | 2543 (14.41%) | 6575 (28.47%) | 13005 (8.31%) | 10451 (7.57%) | 1434 (11.39%) | 1120 (18.95%) | 16181 (11.85%) | 9585 (8.39%) | 1119 (21.92%) | 5477 (31.76%) |
Bronchiectasis | 1814 (0.62%) | 934 (0.37%) | 253 (1.43%) | 627 (2.72%) | 813 (0.52%) | 586 (0.42%) | 123 (0.98%) | 104 (1.76%) | 1007 (0.74%) | 352 (0.31%) | 130 (2.55%) | 525 (3.04%) |
Heart failure | 12650 (4.32%) | 8408 (3.33%) | 1180 (6.69%) | 3062 (13.26%) | 6629 (4.24%) | 5568 (4.03%) | 667 (5.30%) | 394 (6.67%) | 6065 (4.44%) | 2871 (2.51%) | 517 (10.13%) | 2677 (15.52%) |
Corticosteroids (n) | 31355 (10.70%) | 24366 (9.65%) | 3081 (17.46%) | 3908 (16.92%) | 19573 (12.51%) | 15399 (11.16%) | 2568 (20.40%) | 1606 (27.18%) | 11782 (8.63%) | 8967 (7.85%) | 513 (10.05%) | 2302 (13.35%) |
HBP: High blood presion; PE: Pulmonary embolism.
Regarding the use of corticosteroids, it was percentage-wise higher in smokers and ex-smokers compared to non-smokers (p < 0.0001), with no significant differences between smokers and ex-smokers (p = 0.15).
In all subgroups, the average hospital stay was 4 days. Regarding ICU admissions, the percentage of smokers admitted was higher than non-smokers (p < 0.03), being lower in ex-smokers compared to smokers (p < 0.02).
Regarding hospital death (Fig. 1), it was higher in the case of smokers compared to non-smokers, with ex-smokers having the highest percentage (p < 0.0001), although these results may be due to the older age presented in this last subgroup.
In the case of men (Table 2) the average age was 57 years, that of non-smokers 52, smokers 60 and ex-smokers 67 (these differences were significant). As in the overall study population, a higher percentage of all comorbidities tested was observed in smokers compared to non-smokers, as well as in ICU admissions and in-hospital deaths. Among smokers and ex-smokers, no significant change was observed in the percentage of patients with COPD, asthma, ILD, cancer and bronchiectasis, or in the percentage of patients admitted to the ICU, although there was a significant change in terms of in-hospital mortality and the rest of comorbidities not mentioned.
When the female cohort is taken into account (Table 2), the mean age was 56 years, being 57 years for non-smokers, 51 for smokers, and 55 for non-smokers (the age differences were significant). Comorbidities and deaths increased in those who had been related to tobacco compared to those who had not. All comorbidities also increase significantly between those who quit smoking and those who continued, except lung cancer. In-hospital deaths and ICU admissions also did not increase significantly among former and current smokers.
DiscussionThis study has transcendental importance since the entire population of Castilla-La Mancha has been used, instead of selecting a sample of it. Thus, it has been possible to verify the different percentages in the different populations of non-smokers, smokers and ex-smokers in patients diagnosed with COVID in the SESCAM. The study shows that smokers have a higher proportion of comorbidities than non-smokers and that ex-smokers have an even higher proportion.
Regarding in-hospital deaths, they were higher in smokers than in non-smokers (p < 0.0001). These deaths, in addition to tobacco itself, could be mediated by the increase in comorbidities. Age is not considered as an influential factor, since this was practically the same between smokers and non-smokers (56.83 ± 15.3 vs 55.2 ± 26.4). In the case of increased mortality in former smokers, increasing age (65.52 ± 15) could have more weight as a risk factor for most comorbidities and mortality. This effect was already seen in a meta-analysis published in 2022, where an increase in mortality was observed in smokers and ex-smokers.19 The relative increase in mortality in ex-smokers compared to others can be explained by this increase in age.
Regarding age, taking into account the study on women, it can be seen that smokers and ex-smokers have more mortality and comorbidities than non-smokers, despite smokers being 6 years younger and ex-smokers 2 years younger. It reinforces the idea of the relationship between smoking and comorbidities as has already been observed in different studies.20–22
This study confirms the aspects observed in previous reviews, where current and past smoking was related to the increase in ICU admissions and deaths.23,24 It is worth highlighting in Spain the SEMI-COVID study,25 where more than 14,000 patients who were hospitalized during the first months of the pandemic were recorded, and it was seen, as in this study, that hospital mortality is associated with the fact of smoking or having smoked. Furthermore, smoking or having smoked was observed to be an independent factor for a worse prognosis in these patients.
This work is consistent with different meta-analyses as has been demonstrated, but also with case-control studies such as the one from Iran26 or with English cohorts 27. It is the study that has taken into account the largest sample of patients infected with the SARS-COV2 virus, with 293,126 patients.
The results obtained do not support the controversial hypothesis that began at the beginning of the pandemic, in which it was speculated that tobacco could be a protective factor against COVID infection.28,29
Other important consistencies are observed with current knowledge of medicine and with previous literature. Since smoking is the main risk factor for developing COPD,30 it would be expected that the percentage of non-smokers who develop this disease would be minimal. The data shows that it is less than 1% of the studied population (0.91%), while in smokers and ex-smokers it exceeds 10% (10.66% and 18.94% respectively). Something similar happens with lung cancer, a pathology in which smoking is a very important risk factor,31 with the proportion being 0.12% in non-smokers.
It must be considered that it is a retrospective and observational study, however, the main strength of this data-based methodology is that there is no bias in the selection of the population at the hospital level, unlike traditional observational studies. The application of NLP in medical records allows the analysis of the complete clinical course of the patients at the participating center, so the number of patients is the total number of existing patients, including those patients who have died or are not followed up; The nature of the information is not influenced by the study or the observer and the number of variables is as large as the total amount of clinical information collected throughout previous contacts with the center.
Despite being a novel tool, its practical usefulness has already been seen in this type of study, such as in the analysis that was carried out on patients with COPD in Castilla La Mancha and which concluded that there were important deficiencies in both diagnosis and in the treatment of this disease.32 Therefore, this technology has important applications for diagnosis and prognosis.33,34
Regarding limitations, since it is a study based on Big Data, the potential number of variables to include is limited exclusively to the information contained in the EMR. Additionally, the lack of standardization in EHRs regarding the type of data collected, the use of standard medical terminology versus one's own, and the omission of information or misuse of sections in EHRs are potential limitations. These limitations are compensated by the large number of patients included and the enormous amount of data handled by this technology.
In conclusion, there is a higher risk of in-hospital mortality or ICU admission in patients infected with SARS-COV2 who are active smokers or have smoked in the past. Furthermore, these patients have an increase in important comorbidities compared to non-smokers. These relationships probably cannot be fully explained by the age difference.
FundingThis research did not receive any funding from public, commercial or non-profit sector entities.
Conflict of interestsThe authors declare that they have no competing interests.
We’d like to thank Savana for their collaboration in the preparation of this article.