How to extract the title of a PDF document from within a script for renaming?

Question

I have thousands of PDF files in my computers which names are from a0001.pdf to a3621.pdf, and inside of each there is a title; e.g. "aluminum carbonate" for a0001.pdf, "aluminum nitrate" in a0002.pdf, etc., which I'd like to extract to rename my files.

I use this program to rename a file:

path=r"C:\Users\YANN\Desktop\..."

old='string 1'
new='string 2'

def rename(path,old,new):
    for f in os.listdir(path):
        os.rename(os.path.join(path, f), os.path.join(path, f.replace(old, new)))

rename(path,old,new)

I would like to know if there is/are solution(s) to extract the title embedded in the PDF file to rename the file?

You already know how to rename a bunch of files with custom logic. What you don't know is how to *extract the title* for each pdf. That will depend on how those pdf were produced... There are already [a few Q/As](https://stackoverflow.com/q/26494211/6730571) that address how to extract text from pdf using python. Alternatively, perhaps the files have metadata that give away the title... If you could share a sample (one file), maybe someone could help. — Hugues M., Jun 24 '17 at 11:48
So you want to know how to extract the title of a PDF document? How is that title embedded, in the text (first header) or also in the metadata? — Martijn Pieters, Jun 24 '17 at 19:24
Instead of having python doing the rename, I'd have python write all the commands in a file: `mv oldname newname`. Review that file, make manual edits, then source it. That will save you trouble with, for eg. writing many times to (no title) `.pdf` or other edge cases. — Hugues Fontenelle, Jun 29 '17 at 20:09

score 20 · Accepted Answer · edited Nov 26 '19 at 13:58

Installing the package

This cannot be solved with plain Python. You will need an external package such as pdfrw, which allows you to read PDF metadata. The installation is quite easy using the standard Python package manager pip.

On Windows, first make sure you have a recent version of pip using the shell command:

python -m pip install -U pip

On Linux:

pip install -U pip

On both platforms, install then the pdfrw package using

pip install pdfrw

The code

I combined the ansatzes of zeebonk and user2125722 to write something very compact and readable which is close to your original code:

import os
from pdfrw import PdfReader

path = r'C:\Users\YANN\Desktop'


def renameFileToPDFTitle(path, fileName):
    fullName = os.path.join(path, fileName)
    # Extract pdf title from pdf file
    newName = PdfReader(fullName).Info.Title
    # Remove surrounding brackets that some pdf titles have
    newName = newName.strip('()') + '.pdf'
    newFullName = os.path.join(path, newName)
    os.rename(fullName, newFullName)


for fileName in os.listdir(path):
    # Rename only pdf files
    fullName = os.path.join(path, fileName)
    if (not os.path.isfile(fullName) or fileName[-4:] != '.pdf'):
        continue
    renameFileToPDFTitle(path, fileName)

This is very useful, but it's worth mentioning that many PDFs do not have Info.Title. Of 312 fairly random journal articles I checked, more than 1/3 don't have it. But this is great for those that do. — TextGeek, Apr 16 '18 at 13:47
Please do not instruct users to use `sudo` with `pip install`. It is a security issue (see [here](https://askubuntu.com/a/802594/198237)). — Ciprian Tomoiagă, Nov 26 '19 at 13:58

score 11 · Answer 2 · answered Jun 24 '17 at 19:21

11

What you need is a library that can actually read PDF files. For example pdfrw:

In [8]: from pdfrw import PdfReader

In [9]: reader = PdfReader('example.pdf')

In [10]: reader.Info.Title
Out[10]: 'Example PDF document'

answered Jun 24 '17 at 19:21

zeebonk

4,864
4
21
31

score 4 · Answer 3 · answered Jun 29 '17 at 10:59

You can use pdfminer library to parse the PDFs. The info property contains the Title of the PDF. Here is what a sample info looks like :

[{'CreationDate': "D:20170110095753+05'30'", 'Producer': 'PDF-XChange Printer `V6 (6.0 build 317.1) [Windows 10 Enterprise x64 (Build 10586)]', 'Creator': 'PDF-XChange Office Addin', 'Title': 'Python Basics'}]`

Then we can extract the Title using the properties of a dictionary. Here is the whole code (including iterating all the files and renaming them):

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import os

start = "0000"

def convert(var):
    while len(var) < 4:
        var = "0" + var

    return var

for i in range(1,3622):
    var = str(i)
    var = convert(var)
    file_name = "a" + var + ".pdf"
    fp = open(file_name, 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    fp.close()
    metadata = doc.info  # The "Info" metadata
    print metadata
    metadata = metadata[0]
    for x in metadata:
        if x == "Title":
            new_name = metadata[x] + ".pdf"
            os.rename(file_name,new_name)

score 3 · Answer 4 · answered Jun 25 '17 at 02:47

3

You can look at only the metadata using a ghostscript tool pdf_info.ps. It used to ship with ghostscript but is still available at https://r-forge.r-project.org/scm/viewvc.php/pkg/inst/ghostscript/pdf_info.ps?view=markup&root=tm

answered Jun 25 '17 at 02:47

mikep

3,841
8
21

score 0 · Answer 5 · answered Dec 03 '19 at 13:58

Building on Ciprian Tomoiagă's suggestion of using pdfrw, I've uploaded a script which also:

renames files in sub-directories
adds a command-line interface
handles when file name already exists by appending a random string
strips any character which is not alphanumeric from the new file name
replaces non-ASCII characters (such as á è í ò ç...) for ASCII (a e i o c) in the new file name
allows you to set the root dir and limit the length of the new file name from command-line
show a progress bar and, after the script has finished, show some statistics
does some error handling

As TextGeek mentioned, unfortunately not all files have the title metadata, so some files won't be renamed.

Repository: https://github.com/favict/pdf_renamefy

Usage:

After downloading the files, install the dependencies by running pip:

$pip install -r requirements.txt

and then to run the script:

$python -m renamefy <directory> <filename maximum length>

...in which directory is the full path you would like to look for PDF files, and filename maximum length is the length at which the filename will be truncated in case the title is too long or was incorrectly set in the file.

Both parameters are optional. If none is provided, the directory is set to the current directory and filename maximum length is set to 120 characters.

Example:

$python -m renamefy C:\Users\John\Downloads 120

I used it on Windows, but it should work on Linux too.

Feel free to copy, fork and edit as you see fit.

score 0 · Answer 6 · answered Jun 11 '21 at 11:43

has some issues with defined solutions, here is my recipe

from pathlib import Path
from pdfrw import PdfReader
import re

path_to_files = Path(r"C:\Users\Malac\Desktop\articles\Downloaded")

# Exclude windows forbidden chars for name <>:"/\|?*
# Newlines \n and backslashes will be removed anyway
exclude_chars = '[<>:"/|?*]'

for i in path_to_files.glob("*.pdf"):

    try:
        title = PdfReader(i).Info.Title
    except Exception:
        # print(f"File {i} not renamed.")
        pass

    # Some names was just ()
    if not title:
        continue

    # For some reason, titles are returned in brackets - remove brackets if around titles
    if title.startswith("("):
        title = title[1:]

    if title.endswith(")"):
        title = title[:-1]

    title = re.sub(exclude_chars, "", title)
    title = re.sub(r"\\", "", title)
    title = re.sub("\n", "", title)

    # Some names are just ()
    if not title:
        continue

    try:
        final_path = (path_to_files / title).with_suffix(".pdf")
        if final_path.exists():
            continue
        i.rename(final_path)
    except Exception:
        # print(f"Name {i} incorrect.")
        pass

Although there is still some percentage of files, that in acrobat reader in properties has a name, but pdfrw cannot parse it. But it's the same for pdfminer or for pyPDF2... — Daniel Malachov, Jun 11 '21 at 12:30

score -6 · Answer 7 · answered Jun 30 '17 at 08:28

-6

Once you have installed it, open the app and go to the Download folder. You will see your downloaded files there. Just long press the file you wish to rename and the Rename option will appear at the bottom.

answered Jun 30 '17 at 08:28

ThatSkeptic

67
8

How to extract the title of a PDF document from within a script for renaming?

7 Answers7

Installing the package

The code

Usage:

Linked

Related