0

I want to extract the paragraphs after the AB - , that can appear 9000 times in a text file.

Minified example :

AB - This is the part I want to match !
CD - This part is useless
AB - I can also match
texts on multiple 
lines !
EF - Did you get my problem ?
GH - Ok, i think that's
enough.

Expected output:

This is the part I want to match !

I can also match
texts on multiple 
lines !

Here is a screenshot of the real file, if you want to see what it really looks like.

Kindly help me how I can delete extra information or please guide me on how I can only extract abstracts without any other information.

totok
  • 1,436
  • 9
  • 28
  • 1
    Hi, welcome to SO. Can you try to write a [minimal simple example of your problem](https://stackoverflow.com/help/minimal-reproducible-example) ? Refer to [this](https://stackoverflow.com/help/how-to-ask) guide to write good questions, so people will answer you faster – totok Jul 01 '20 at 10:01
  • what is your question exactly? what is the abstract in your file? – deadshot Jul 01 '20 at 10:06
  • @komatiraju032 i am talking about abstracts of journal articles.The dataset in the form of text file I have downloaded from PubMed .in short my question is how I can delete a specific portion of texts from my text file, to simplify my dataset between two markers . like CN - Gemelli Against COVID-19 Post-Acute Care Study Group LA - eng PT - Journal Article DEP - 20200611 TA - Aging Clin Exp Res – Anam Shahid Jul 01 '20 at 10:59
  • @totok thanks for your response. i will try to edit my question – Anam Shahid Jul 01 '20 at 11:05
  • can you post the expected output – deadshot Jul 01 '20 at 11:07
  • @komatiraju032 please read my problem again. i edited my question and post. thank you. I only want (AB ) Abstract section. – Anam Shahid Jul 01 '20 at 11:17
  • can you post the original file – deadshot Jul 01 '20 at 11:26
  • @komatiraju032 I can't post an original file because its contains data more than 9000. I posted one abstract with that extra information that I want to delete from that file. In short, I only want to extract abstracts from the whole file. Please check my post again. Thank you – Anam Shahid Jul 01 '20 at 11:31
  • just share data of 3 or 4 so it will be easy to work you posted half the data if i give the solution from this you need to process each article two times – deadshot Jul 01 '20 at 11:33
  • @komatiraju032 ok let me edit my post again. – Anam Shahid Jul 01 '20 at 11:49
  • @komatiraju032 please check my post again. i posted data of 4 articles. I only want to extract Abstracts (Ab) from the whole file that is in the form of a paragraph – Anam Shahid Jul 01 '20 at 11:58

2 Answers2

0

If your file isn't too big to be read all at once, you can use this regex expression to match what you need, by selecting all first matching group of all matches :

AB\s+-\s((.*\s*)*?)\K([A-Z]{2}\s+-\s)

Test it here

Read more about regex in python here.

Learn regex here.

EDIT : I managed to remove the "too much" at the end of each match, but I don't think I did it the good way :

AB\s+-\s+((.*\s*)*?)(?:[A-Z]{2}\s+-\s)\K

If someone can improve this in comments, I would be cool !

Test it here.

totok
  • 1,436
  • 9
  • 28
  • Thank you so much for your response. I just tried your regex pattern. but I have found an error that is " error: bad escape \K" – Anam Shahid Jul 01 '20 at 19:01
0

Assuming that pub_file is predictable and CI always follows AB:

# get line numbers where AB is start and CI is end line for the abstract
a = []
#with open("pub_file.txt", "r", encoding="utf-8") as f:
    # next(f)
with open("pub_file.txt", "r") as f:
    f = f.readlines()
    start = 0
    end = 0
    for (line_number, text) in enumerate(f):
        if text.startswith("AB"):
            start = line_number + 1
            a.append(start)
            print("ab-->", start)
        if text.startswith("CI"):
            end = line_number + 1
            print("ci-->", end)
            a.append(end)

# write to file
out = open("OUTPUT.TXT", "w")
with open("pub_file.txt", "r") as f:
    f = f.readlines()
    for first, second in zip(a[0::2], a[1::2]):
        print(first, second)
        for i in f[first:second]:
            print(i)
            out.write(i)

OUTPUT.TXT file:

AB  - OBJECTIVE: The clinical manifestations of COVID-19 run from asymptomatic disease to 
      severe acute respiratory syndrome. Older age and comorbidities are associated to 
      more severe disease. A role of obesity is suspected. METHODS: We enrolled patients 
      hospitalized in the medical COVID-19 ward with SARS-CoV-2 related pneumonia. Primary 
      outcome of the study was to assess the relationship between the severity of COVID-19 
      and obesity classes according to BMI. RESULTS: 92 patients (61.9% males; age 
      70.5±13.3 years) were enrolled. Patients with overweight and obesity were younger 
      than normal-weight patients (68.0±12.6 and 67.0±12.6 years vs. 76.1±13.0 years, 
      p<0.01). A higher need for assisted ventilation beyond pure oxygen support (Invasive 
      Mechanical Ventilation or Non-Invasive Ventilation) and a higher admission to 
      intensive or semi-intensive care units was observed in patients with overweight and 
      obesity (p<0.01 and p < 0.05, respectively) even after adjusting for sex, age and 
      comorbidities (p<0.05 and p<0.001, respectively), or when patients with dementia or 
      advanced cancer were removed from the analysis (p<0.05). CONCLUSION: Patients with 
      overweight and obesity admitted in a medical ward for SARS-CoV-2 related pneumonia, 
      despite their younger age, required more frequently assisted ventilation and access 
      to intensive or semi-intensive care units than normal weight patients.
AB  - The Coronavirus Disease 2019 (COVID-19) pandemic of severe acute respiratory 
      syndrome coronavirus 2 (SARS-CoV-2) infection is causing considerable morbidity and 
      mortality worldwide. Multiple reports have suggested that patients with heart 
      failure (HF) are at a higher risk of severe disease and mortality with COVID-19. 
      Moreover, evaluating and treating HF patients with comorbid COVID-19 represents a 
      formidable clinical challenge as symptoms of both conditions may overlap and they 
      may potentiate each other. Limited data exist regarding comprehensive management of 
      HF patients with concomitant COVID-19. Since these issues pose serious new 
      challenges for clinicians worldwide, HF specialists must develop a structured 
      approach to the care of patients with COVID-19 and be included early in the care of 
      these patients. Therefore, the Heart Failure Association of the European Society of 
      Cardiology and Chinese Heart Failure Association & National Heart Failure Committee 
      conducted web-based meetings to discuss these unique clinical challenges and reach a 
      consensus opinion to help providers worldwide deliver better patient care. The main 
      objective of this position paper is to outline the management of HF patients with 
      concomitant COVID-19 based on the available data and personal experiences of 
      physicians from Asia, Europe and United States. This article is protected by 
      copyright. All rights reserved.
AB  - The coronavirus 2019 (COVID-19) pandemic has led to laws and policies including 
      national school closures, lockdown or shelter in place laws, and social distancing 
      recommendations that may translate to higher overall screen time among children and 
      adolescents for the duration of these laws and policies. These policies may need to 
      be periodically re-instated to control future COVID-19 recurrences or other national 
      emergencies. Excessive screen time is associated with cardiovascular disease risk 
      factors such as obesity, high blood pressure, and insulin resistance because it 
      increases sedentary time and is associated with snacking.
AB  - Perhaps for the first time in history, a single statistical measure is now dictating 
      the entirety of UK government policy. The 'basic reproduction number', R0 value for 
      Covid-19 is more directly determining economic and social policy than has ever the 
      inflation rate, interest rate, or exchange rate. It is encouraging to see political 
      policy for once 'rational' but disappointing it took a pandemic to make it so. 
      However, is R0 an appropriate and significant measure? Like many 
      mathematics/statistical parameters, R0 is relatively easy to explain, more 
      complicated to understand (even graphically), and very difficult to calculate, or 
      use for modelling. Given its significance for all our lives, it is important to 
      understand a little of its background. This article seeks to explain the issues in a 
      non-technical way, relegating all equations (used sparingly) to appendices.
lww
  • 624
  • 1
  • 7
  • 14
  • Thank you so much for your response. what we will do if Abstracts contains different keyword at the end ... means somewhere in a file AB followed by "CN" and somewhere AB followed by "FAU". in this condition what we will do ? – Anam Shahid Jul 02 '20 at 05:54