-1

I use Books app on my phone to read. I highlight a lot of stuff and then when I'm done reading, I move all the highlights to a PKM application. When I'm transferring the highlight, the Book app automatically attaches the citation to every note. For example,

“The NCI’s trials would be systematic: every trial would test a crucial piece of logic or hypothesis and produce yes and no answers. The trials would be sequential: the lessons of one trial would lead to the next and so forth—a relentless march of progress until leukemia had been cured. The trials would be objective, randomized if possible, with clear, unbiased criteria to assign patients and measure responses. ”

Excerpt from: The Emperor of All Maladies Siddhartha Mukherjee This material may be protected by copyright.“Like cancer cells, mycobacteria—the germs that cause tuberculosis—also became resistant to antibiotics if the drugs were used singly. Bacteria that survived a single-drug regimen divided, mutated, and acquired drug resistance, thus making that original drug useless. To thwart this resistance, doctors treating TB had used a blitzkrieg of antibiotics—two or three used together like a dense pharmaceutical blanket meant to smother all cell division and stave off bacterial resistance, thus extinguishing the infection as definitively as possible. But could two or three drugs be tested simultaneously against cancer—or would the toxicities be so forbidding that they would instantly kill patients? As Freireich, Frei, and Zubrod studied the growing list of antileukemia drugs, the notion of combining drugs emerged with growing clarity: toxicities notwithstanding, annihilating leukemia might involve using a combination of two or more drugs.”

Excerpt from: The Emperor of All Maladies Siddhartha Mukherjee This material may be protected by copyright.“The butcher shop”

And so on.

I want to remove these repeating lines from the big corpus of all the highlights using Python. Can someone help me in doing this?

I created a text file and tried to use the readline() method to loop through the entire file. But that didn't work. Even if it did work, I don't know how to loop through the entire file, remove specific, repeating bits and arrange them back again with proper formatting.

fzzrxx
  • 11

1 Answers1

0

Welcome to StackOverflow,

This is an easy question,I strongly recommend you to make a further search around before posting (but don't worry, we all were in similar situations before).

Try this quick solution, it should work for any txt file:

import os

os.chdir('/your/path/to/txt/file')

# The file that we want to modify:
file_txt = 'example.txt'

# First we read the full text
with open(file_txt,'r') as f:
    full_text = f.readlines()
# The variable full_text is a list whose elements are each line of the document

# Now we create a new document which will contain the desired text:
with open('changed_text.txt','w') as f: 
    # Let's find every "bothering line" and delete them:
    for i in range(len(full_text)):
        string = full_text[i]
        if 'Excerpt from:' in string:
            # We replace that repeating line for a new empty line
            full_text[i] = '\n'
        # We write the line in the new document (even if it was modified or not)
        f.write(full_text[i])    
    f.close()

Hope this solve your specific problem!

Lino Orion
  • 59
  • 1
  • 7
  • It worked, except for the last line, which is 'This material may be protected by copyright'. I don't want to delete it because it deletes the entire line **including the highlight**. So I ran it through another loop and replaced it with a blank space, for i in range(len(lines)): string = lines[i] if 'This material may be protected by copyright.' in string: print(string.replace('This material may be protected by copyright.','')) Is there any other way I could've done it better? Thanks for your response. – fzzrxx Mar 25 '23 at 07:32
  • Basically what you are doing is printing in the console your desired result but not changing anything in your txt file. However, I am not understanding your aim, what do you need exactly? Remove all lines that start with "Excerpt from: ..."? Switch the parts that contain "This material may be protected by copyright" for an empty space? Is this content always in the same lines of the highlights? – Lino Orion Mar 26 '23 at 13:41
  • Right, I actually printed it to the console just to copy the wall of text without the repetitive citations. – fzzrxx Apr 08 '23 at 08:53