Extracting bold text from Resumes( .Docx,.Doc,PDF) using Python

Question

I have thousands of resumes in any format like word with .doc, .docx and pdf.

I want to extract bold text from these documents using textract library in python. is there a way to extract using textract?

I am currently using textract to extract any document type like PDF or word . Is it possible to extract lines which are bold? — Gayathri, Sep 01 '18 at 06:28

score 4 · Answer 1 · edited May 20 '22 at 13:40

4

An easy solution would be to use the python-docx package. install the package using ( !pip install python-docx )

You'll need to convert your pdf files to .docx . you can do that using any online pdf to docx converter or use python to do that.

the following lines of codes will extract all bold and italic contents of your resumes and save them in a dictionary called boltalic_Dict. you may retrieve either later on.

from docx import *

document = Document('path_to_your_files')
bolds=[]
italics=[]
for para in document.paragraphs:
    for run in para.runs:
        if run.italic :
            italics.append(run.text)
        if run.bold :
            bolds.append(run.text)

boltalic_Dict={'bold_phrases':bolds,
              'italic_phrases':italics}

edited May 20 '22 at 13:40

Dharman

30,962
25
85
135

answered Sep 11 '18 at 02:25

Zia

389
1
3
17

2

I have files in different formats, .Doc, .Docx, PDF, .RTF. that is why i am using textract library. is it possible to get extract the same information using textract library? – Gayathri Sep 11 '18 at 06:26
Revised code in my answer if this code breaks contiguous bold/italic text for others too. – Recap_Hessian Jun 28 '21 at 15:04

score 1 · Answer 2 · answered Jun 28 '21 at 15:01

Building on m.borhan's answer, since in their code some contiguous bold and italic portions failed to output as single item:

from docx import *

document = Document('path_to_your_files')
bolds=[]
italics=[]
last_bold = "" #last bold part
last_italic = "" #last italic part
for para in document.paragraphs:
    for run in para.runs:
        if run.italic :
            last_italic = last_italic + run.text
        elif run.bold :
            last_bold = last_bold + run.text
        else:
            italics.append(last_italic)
            bolds.append(last_bold)
            last_italic = ""
            last_bold = ""
italics = [i for i in italics if i]
bolds = [i for i in bolds if i]
boltalic_Dict={'bold_phrases':bolds,
              'italic_phrases':italics}

Extracting bold text from Resumes( .Docx,.Doc,PDF) using Python

2 Answers2

Linked