0

Edit 4:

Simpler example of what I want to do:

I have a list like this:

sentences = ['Hello, how are','how are you','you doing?']

And I want to turn it into a string like this:

sentence = 'Hello, how are you doing?'

Any help is appreciated!

Original post:

I'm trying to get highlighted text out of a .pdf file and put it inside a .docx file with that same name.

Here's the code for it:

    from typing import List, Tuple

import fitz  # install with 'pip install pymupdf'
import os
from docx import Document


def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str:
    points = annot.vertices
    quad_count = int(len(points) / 4)
    sentences = []
    for i in range(quad_count):
        # where the highlighted part is
        r = fitz.Quad(points[i * 4: i * 4 + 4]).rect

        words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
        sentences.append(" ".join(w[4] for w in words))
    sentence = " ".join(sentences)
    return sentence


def handle_page(page):
    wordlist = page.getText("words")  # list of words on page
    wordlist.sort(key=lambda w: (w[3], w[0]))  # ascending y, then x
    separator = "PÁGINA NÚMERO " + str(page.number) + ": "

    highlights = []
    annot = page.firstAnnot
    while annot:
        if annot.type[0] == 8:
            highlights.append(separator)
            highlights.append(_parse_highlight(annot, wordlist))
            document.add_paragraph(highlights)
        annot = annot.next

    return highlights


def main(filepath: str) -> List:
    doc = fitz.open(filepath)

    highlights = []
    for page in doc:
        highlights += handle_page(page)

    return highlights

dir_files = [f for f in os.listdir(".") if os.path.isfile(os.path.join(".", f))]
print(dir_files)
document = Document()
for file in dir_files:  # look at every file in the current directory
    if file.endswith('.pdf'):  # if it is a PDF, use it
        print('Working on converting: ' + file)

        main(file)
        document.save(file.replace(".pdf",".docx"))

I have to say I didn't write the part for getting the text out of the highlights. I got it here.

The problem is the list it creates gets repeated items. Here's a sample of the output to the .docx file:

PÁGINA NÚMERO 0: En los primeros tiempos de la microscopia electronica los bi- logos pensaban que los organulos de una célula eucarionte primeros tiempos de la microscopia electronica los bi- logos pensaban que los organulos de una célula eucarionte flota- ban libremente en el citosol. Pero los progresos realizados pensaban que los organulos de una célula eucarionte flota- ban libremente en el citosol. Pero los progresos realizados tanto en la microscopia 6ptica como en la microscopia en el citosol. Pero los progresos realizados tanto en la microscopia 6ptica como en la microscopia electronica han revelado la presencia del citoesqueleto, una red de fibras microscopia 6ptica como en la microscopia electronica han revelado la presencia del citoesqueleto, una red de fibras que se extiende a través del citoplasma (fig. 6-20). El citoesqueleto, revelado la presencia del citoesqueleto, una red de fibras que se extiende a través del citoplasma (fig. 6-20). El citoesqueleto, que desempena un papel importante en la organizacion a través del citoplasma (fig. 6-20). que desempena un papel importante en la organizacion estructuras y las actividades de la célula, PÁGINA NÚMERO 0: En los primeros tiempos de la microscopia electronica los bi- logos pensaban que los organulos de una célula eucarionte primeros tiempos de la microscopia electronica los bi- logos pensaban que los organulos de una célula eucarionte flota- ban libremente en el citosol. Pero los progresos realizados pensaban que los organulos de una célula eucarionte flota- ban libremente en el citosol. Pero los progresos realizados tanto en la microscopia 6ptica como en la microscopia en el citosol. Pero los progresos realizados tanto en la microscopia 6ptica como en la microscopia electronica han revelado la presencia del citoesqueleto, una red de fibras microscopia 6ptica como en la microscopia electronica han revelado la presencia del citoesqueleto, una red de fibras que se extiende a través del citoplasma (fig. 6-20). El citoesqueleto, revelado la presencia del citoesqueleto, una red de

As you can see it repeats the same thing more than once.

I think it's because my pdf is split in two like this:

enter image description here

But it would be an inconvinience to split every page in half and create a longer pdf. Another problem could be that I OCR'd this pdf and maybe it's causing issues (this is why the text output is slightly different from the pdf but I'm fine with that).

So I'm looking for a way to check if the "highlights" list has repeated items and delete them. Or maybe check before they get added to the list and not add them. But I'm not that experienced in programming so I'm asking for your help!

Any help is appreciated!

And sorry for any bad english!

Edit 1:

I've now tried doing this:

...
def main(filepath: str) -> List:
    doc = fitz.open(filepath)

    highlights = []
    for page in doc:
        highlights += handle_page(page)
        highlights = set(highlights)
        highlights = list(highlights)
        document.add_paragraph(highlights)

    return highlights
...

But it doesn't work. It even changes the order of the items because it deletes stuff that was added first and I don't want that.

Edit 2:

I think I found what's giving me trouble.

I did print(sentences) before they get joined into "sentence" and this is what I get:

['En los primeros tiempos de la microscopia electronica los bi- logos pensaban que los organulos de una célula eucarionte', 'primeros tiempos de la microscopia electronica los bi- logos pensaban que los organulos de una célula eucarionte flota- ban libremente en el citosol. Pero los progresos realizados', 'pensaban que los organulos de una célula eucarionte flota- ban libremente en el citosol. Pero los progresos realizados tanto en la microscopia 6ptica como en la microscopia', 'en el citosol. Pero los progresos realizados tanto en la microscopia 6ptica como en la microscopia electronica han revelado la presencia del citoesqueleto, una red de fibras', 'microscopia 6ptica como en la microscopia electronica han revelado la presencia del citoesqueleto, una red de fibras que se extiende a través del citoplasma (fig. 6-20). El citoesqueleto,', 'revelado la presencia del citoesqueleto, una red de fibras que se extiende a través del citoplasma (fig. 6-20). El citoesqueleto, que desempena un papel importante en la organizacion', 'a través del citoplasma (fig. 6-20). que desempena un papel importante en la organizacion estructuras y las actividades de la célula,']

As you can see the items inside "sentences" contain one another so even if I used set(sentences) it wouldn't work. The OCR had something to do with that I'm pretty sure.

So now I think I need to shift my focus into crossreferencing each item inside "sentences".

Like when doing A+B-(A∩B)=C. This means C wouldn't have duplicates and, if the order is correct, would make a comprihensible sentence. But I'm completely blank in if there's even a way to accomplish this.

I also learned this to eliminate dupes and still keep the order of a list: list(dict.fromkeys())

Edit 3:

Only ruunning this code:

def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str:
    points = annot.vertices
    quad_count = int(len(points) / 4)
    sentences = []
    for i in range(quad_count):
        # where the highlighted part is
        r = fitz.Quad(points[i * 4: i * 4 + 4]).rect

        words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
        sentences.append(" ".join(w[4] for w in words))
    print(sentences)
    sentence = " ".join(sentences)
    print(sentence)
    return sentence


def handle_page(page):
    wordlist = page.getText("words")  # list of words on page
    wordlist.sort(key=lambda w: (w[3], w[0])) 
    separator = "PÁGINA NÚMERO " + str(page.number) + ": "

    highlights = []
    annot = page.firstAnnot
    while annot:
        if annot.type[0] == 8:
            highlights.append(separator)
            highlights.append(_parse_highlight(annot, wordlist))
        annot = annot.next
    return highlights


def main(filepath: str) -> List:
    doc = fitz.open(filepath)

    highlights = []
    for page in doc:
        highlights += handle_page(page)
    print(highlights)
    return highlights

main(example.pdf)

This is the only thing I higlighted:

enter image description here

Here's what the terminal says:

['El citoesqueleto es una red de fibras que organiza las estructuras', 'citoesqueleto es una red de que organiza las estructuras y las actividades', 'las estructuras y las actividades de la célula', 'En los primeros tiempos de la microscopia electronica los bi- logos pensaban que los organulos de una célula eucarionte', 'primeros tiempos de la microscopia electronica los bi- logos pensaban que los organulos de una célula eucarionte flota- ban libremente en el citosol. Pero los progresos realizados', 'pensaban que los organulos de una célula eucarionte flota- ban libremente en el citosol. Pero los progresos realizados tanto en la microscopia 6ptica como en la microscopia', 'en el citosol. Pero los progresos realizados tanto en la microscopia 6ptica como en la microscopia electronica han revelado la presencia del citoesqueleto, una red de fibras', 'microscopia 6ptica como en la microscopia revelado la presencia del citoesqueleto, extiende a través del citoplasma']

El citoesqueleto es una red de fibras que organiza las estructuras citoesqueleto es una red de que organiza las estructuras y las actividades las estructuras y las actividades de la célula En los primeros tiempos de la microscopia electronica los bi- logos pensaban que los organulos de una célula eucarionte primeros tiempos de la microscopia electronica los bi- logos pensaban que los organulos de una célula eucarionte flota- ban libremente en el citosol. Pero los progresos realizados pensaban que los organulos de una célula eucarionte flota- ban libremente en el citosol. Pero los progresos realizados tanto en la microscopia 6ptica como en la microscopia en el citosol. Pero los progresos realizados tanto en la microscopia 6ptica como en la microscopia electronica han revelado la presencia del citoesqueleto, una red de fibras microscopia 6ptica como en la microscopia revelado la presencia del citoesqueleto, extiende a través del citoplasma

['PÁGINA NÚMERO 0: ', 'El citoesqueleto es una red de fibras que organiza las estructuras citoesqueleto es una red de que organiza las estructuras y las actividades las estructuras y las actividades de la célula En los primeros tiempos de la microscopia electronica los bi- logos pensaban que los organulos de una célula eucarionte primeros tiempos de la microscopia electronica los bi- logos pensaban que los organulos de una célula eucarionte flota- ban libremente en el citosol. Pero los progresos realizados pensaban que los organulos de una célula eucarionte flota- ban libremente en el citosol. Pero los progresos realizados tanto en la microscopia 6ptica como en la microscopia en el citosol. Pero los progresos realizados tanto en la microscopia 6ptica como en la microscopia electronica han revelado la presencia del citoesqueleto, una red de fibras microscopia 6ptica como en la microscopia revelado la presencia del citoesqueleto, extiende a través del citoplasma']

Now I see that some text I didn't highlight is also being turned into a "sentences" item. Again, I believe this to be the OCR at fault or maybe I could tune the pymupdf better idk.

3 Answers3

0

You can use python set in order to remove duplicates. It should be done AFTER the loop and BEFORE the return

highlights = ['h1', 'h2', 'h1']
no_dups_highlights = list(set(highlights))
print(no_dups_highlights)

output

['h2', 'h1']
balderman
  • 22,927
  • 7
  • 34
  • 52
  • Thanks for the response. The thing is I need the end result to be ['h1', 'h2'] and not the other way around because when i create a new paragraph I need the "PÁGINA NÚMERO" (this is the page number) to be there so I can know from wich page I got that higlighted text. Is there a way to do that? – JupiterJones Aug 14 '21 at 14:22
  • if you have something like `highlights = [('h1', 12), ('h2', 45), ('h1', 78)]` (where the number is the page number) - which h1 would you like to keep? – balderman Aug 14 '21 at 14:25
  • Well but that's different from what the code creates right? If the "h" is the text then there wouldn't be another page number with the same text. Also, I've updated the question with something new. – JupiterJones Aug 14 '21 at 14:58
  • Try to come with a sample code that demonstrate the issue of the duplicates in a list. Remove all lines of code that are not relevant. As for now it is not clear to me. – balderman Aug 14 '21 at 15:00
  • I updated the post. Is that what you're talking about? – JupiterJones Aug 14 '21 at 16:01
0

So set shuffles the members; you can use it, but only as a helper.

There are several ways to do this:

  • Use a set to keep track of what you've already seen:
seen = set()
filtered = []
for highlight in highlights:
    if highlight not in seen:
        filtered.append(highlight)
        seen.add(highlight)
  • If you're using a sufficiently recent version of Python, dict maintains order:
filtered = list({highlight: None for highlight in highlights}.keys())
  • Ignore the theoretical slowness and just code it in the most straightforward way; it'll likely be fast enough in practice:
filtered = []
for highlight in highlights:
    if highlight not in filtered:
        filtered.append(higwhlight)

All these are for exact matching; if you end up needing approximate matching, you'll probably have to use the last approach anyway, with some sort of similarity measure.

Jiří Baum
  • 6,697
  • 2
  • 17
  • 17
  • Thanks for the response! I think I should change the question at this point. The thing is I want to merge all the "sentences" items into a "sentence" string but avoid merging the parts of the items in "sentences" that repeat. If I were to use the last approach, implementing a similarity measure, it could lead to errors in the final "sentence" as the amount of overlaping they have isn't consistant. For the time being I'm gonna lay this code off as I don't have any more free time to work on it. I'm gonna be using Sumnotes to extract the higlighted text for now. – JupiterJones Aug 15 '21 at 09:45
  • Right, finding (near-) overlaps among text to stitch the whole text together from fragments is a completely different question. There must be libraries and techniques out there; it's used on DNA sequences, which get read in fragments, need to be assembled and are huge – Jiří Baum Aug 15 '21 at 09:50
  • I edited the question and added a simpler example. I think that sums up pretty well what I want to do. Maybe I could also look into improving the usage of the pymupdf library as that could also be the cause of the repeated words, I'm just assuming it's the OCR. – JupiterJones Aug 15 '21 at 10:05
0

I've implemented another method of getting the 'sentences' list out of the pdf and now it works as intended.

This is the code:

from typing import List, Tuple

import fitz
import os
from docx import Document

_threshold_intersection = 0.9

def _check_contain(r_word, points):
    r = fitz.Quad(points).rect
    r.intersect(r_word)

    if r.getArea() >= r_word.getArea() * _threshold_intersection:
        contain = True
    else:
        contain = False
    return contain


def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str:
    quad_points = annot.vertices
    quad_count = int(len(quad_points) / 4)
    sentences = ['' for i in range(quad_count)]

    for i in range(quad_count):
        points = quad_points[i * 4: i * 4 + 4]
        words = [
            w for w in wordlist if
            _check_contain(fitz.Rect(w[:4]), points)
        ]
        sentences[i] = ' '.join(w[4] for w in words)
    sentence = ' '.join(sentences)
    document.add_paragraph(sentence)
    return sentence

def handle_page(page):
    wordlist = page.getText("words")
    wordlist.sort(key=lambda w: (w[3], w[0]))

    highlights = []
    annot = page.firstAnnot
    while annot:
        if annot.type[0] == 8:
            highlights.append(_parse_highlight(annot, wordlist))
        annot = annot.next

    return highlights

def main(filepath: str) -> List:
    doc = fitz.open(filepath)

    highlights = []
    for page in doc:
        document.add_paragraph("Page Number " + str(page.number+1) + ": ")
        highlights += handle_page(page)

    return highlights

dir_files = [f for f in os.listdir(".") if os.path.isfile(os.path.join(".", f))]
document = Document()
for file in dir_files:
    if file.endswith('.pdf'):
        print('Working on converting: ' + file)
        main(file)
        document.save(file.replace(".pdf",".docx"))

Got this new method here.