4

Consider the following article

https://arxiv.org/pdf/2101.05907.pdf

It's a typically formatted academic paper with only two pictures in pdf file.

The following code was used to extract the text and equation from the paper

#Related code explanation: https://stackoverflow.com/questions/45470964/python-extracting-text-from-webpage-pdf
import io
import requests
r = requests.get(url)
f = io.BytesIO(r.content)

#Related code explanation: https://stackoverflow.com/questions/45795089/how-can-i-read-pdf-in-python
import PyPDF2
fileReader = PyPDF2.PdfFileReader(f)

#Related code explanation: https://automatetheboringstuff.com/chapter13/
print(fileReader.getPage(0).extractText())

However, the result was not quite correct

Bohmpotentialforthetimedependentharmonicoscillator
FranciscoSoto-Eguibar
1
,FelipeA.Asenjo
2
,SergioA.Hojman
3
andH
´
ectorM.
Moya-Cessa
1
1
InstitutoNacionaldeAstrof´
´
OpticayElectr´onica,CalleLuisEnriqueErroNo.1,SantaMar´Tonanzintla,
Puebla,72840,Mexico.
2
FacultaddeIngenier´yCiencias,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
3
DepartamentodeCiencias,FacultaddeArtesLiberales,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
DepartamentodeF´FacultaddeCiencias,UniversidaddeChile,Santiago7800003,Chile.
CentrodeRecursosEducativosAvanzados,CREA,Santiago7500018,Chile.
Abstract.
IntheMadelung-Bohmapproachtoquantummechanics,weconsidera(timedependent)phasethatdependsquadrati-
callyonpositionandshowthatitleadstoaBohmpotentialthatcorrespondstoatimedependentharmonicoscillator,providedthe
timedependentterminthephaseobeysanErmakovequation.
Introduction
Harmonicoscillatorsarethebuildingblocksinseveralbranchesofphysics,fromclassicalmechanicstoquantum
mechanicalsystems.Inparticular,forquantummechanicalsystems,wavefunctionshavebeenreconstructedasisthe
caseforquantizedincavities[1]andforion-laserinteractions[2].Extensionsfromsingleharmonicoscillators
totimedependentharmonicoscillatorsmaybefoundinshortcutstoadiabaticity[3],quantizedpropagatingin
dielectricmedia[4],Casimire

ect[5]andion-laserinteractions[6],wherethetimedependenceisnecessaryinorder
totraptheion.
Timedependentharmonicoscillatorshavebeenextensivelystudiedandseveralinvariantshavebeenobtained[7,8,9,
10,11].Alsoalgebraicmethodstoobtaintheevolutionoperatorhavebeenshown[12].Theyhavebeensolvedunder
variousscenariossuchastimedependentmass[12,13,14],timedependentfrequency[15,11]andapplicationsof
invariantmethodshavebeenstudiedindi

erentregimes[16].Suchinvariantsmaybeusedtocontrolquantumnoise
[17]andtostudythepropagationoflightinwaveguidearrays[18,19].Harmonicoscillatorsmaybeusedinmore
generalsystemssuchaswaveguidearrays[20,21,22].
Inthiscontribution,weuseanoperatorapproachtosolvetheone-dimensionalSchr
¨
odingerequationintheBohm-
Madelungformalismofquantummechanics.ThisformalismhasbeenusedtosolvetheSchr
¨
odingerequationfor
di

erentsystemsbytakingtheadvantageoftheirnon-vanishingBohmpotentials[23,24,25,26].Alongthiswork,
weshowthatatimedependentharmonicoscillatormaybeobtainedbychoosingapositiondependentquadratictime
dependentphaseandaGaussianamplitudeforthewavefunction.Wesolvetheprobabilityequationbyusingoperator
techniques.Asanexamplewegivearationalfunctionoftimeforthetimedependentfrequencyandshowthatthe
Bohmpotentialhasdi

erentbehaviorforthatfunctionalitybecauseanauxiliaryfunctionneededinthescheme,
namelythefunctionsthatsolvestheErmakovequation,presentstwodi

erentsolutions.
One-dimensionalMadelung-Bohmapproach
ThemainequationinquantummechanicsistheSchrodingerequation,thatinonedimensionandforapotential
V
(
x
;
t
)
iswrittenas(forsimplicity,weset
}
=
1)
i
@ 
(
x
;
t
)
@
t
=

1
2
m
@
2
 
(
x
;
t
)
@
x
2
+
V
(
x
;
t
)
 
(
x
;
t
)
(1)
arXiv:2101.05907v1  [quant-ph]  14 Jan 2021

As shown:

  1. The spacing, such as the title, disappeared and resulted meaning less strings.
  2. The latex equations was wrong, and it got worse on the second page.

How to fix this and extract text and equations correctly from the pdf file that was generated from latex?

mzjn
  • 48,958
  • 13
  • 128
  • 248
  • did you find a solution for your problem? I found a pdf that solves my problem but the equation goes out of bounds, and I can't parse it with PyPDF2 :/ – Guilherme Correa Apr 12 '22 at 18:31
  • Exeactly what text output do you desire for the math formulas within this input PDF? Do you want LaTeX source like `$\hbar=1$`? That may work with some extractor, but anything beyond that (like subscripts, fractions, sums, roots) will probably not work. Implementing such an extraction is quite hard (because the relevant structural info is not in the PDF), the resulting code will be fragile (i.e. it won't work with all PDFs), and probably there is not enough buyers for such a feature. – pts Jul 24 '23 at 14:13
  • @pts The extractor to back convert to a standard language such as latex, or in fact any languages such as the Microsoft Unicode, should do the trick. If the PDF's own representation code, that was being used by the pdf reader to print out the equation on the screen, could be used, too, since it's just going to be a linear map between those. – ShoutOutAndCalculate Jul 24 '23 at 23:56

1 Answers1

2

In the mean time, PyPDF2 got deprecated. Use pypdf (I'm the maintainer of both; see migrtion guide).

We don't have anything specific for equations, but text extraction in general:

import io
import requests
from pypdf import PdfReader

# Download content
url = "https://arxiv.org/pdf/2101.05907.pdf"
r = requests.get(url)
f = io.BytesIO(r.content)

# Extract text
reader = PdfReader(f)
print(reader.pages[0].extract_text())

The last paragraph is

enter image description here

and pypdf gives:

The main equation in quantum mechanics is the Schrodinger equation, that in one dimension and for a potential V(x;t)
is written as (for simplicity, we set }=1)
i@ (x;t)
@t=1
2m@2 (x;t)
@x2+V(x;t) (x;t) (1)

You can see that the text is fine, but all of the math characters / equation structure is not represented well.

Math text extraction will for sure stay suboptimal for a long time, but I've opened a ticket to improve the text extraction (the partial, phi, maybe also the hbar): https://github.com/py-pdf/pypdf/issues/2009

See also: Why text extracting is hard. In summary: pypdf will hopefully get better with extracting the greek letters

Full extractiong with pypdf==3.13.0

Bohm potential for the time dependent harmonic oscillator
Francisco Soto-Eguibar1, Felipe A. Asenjo2, Sergio A. Hojman3and H ´ector M.
Moya-Cessa1
1Instituto Nacional de Astrof´ ısica, ´Optica y Electr´ onica, Calle Luis Enrique Erro No. 1, Santa Mar´ ıa Tonanzintla,
Puebla, 72840, Mexico.
2Facultad de Ingenier´ ıa y Ciencias, Universidad Adolfo Ib´ a˜ nez, Santiago 7491169, Chile.
3Departamento de Ciencias, Facultad de Artes Liberales, Universidad Adolfo Ib´ a˜ nez, Santiago 7491169, Chile.
Departamento de F´ ısica, Facultad de Ciencias, Universidad de Chile, Santiago 7800003, Chile.
Centro de Recursos Educativos Avanzados, CREA, Santiago 7500018, Chile.
Abstract. In the Madelung-Bohm approach to quantum mechanics, we consider a (time dependent) phase that depends quadrati-
cally on position and show that it leads to a Bohm potential that corresponds to a time dependent harmonic oscillator, provided the
time dependent term in the phase obeys an Ermakov equation.
Introduction
Harmonic oscillators are the building blocks in several branches of physics, from classical mechanics to quantum
mechanical systems. In particular, for quantum mechanical systems, wavefunctions have been reconstructed as is the
case for quantized fields in cavities [1] and for ion-laser interactions [2]. Extensions from single harmonic oscillators
to time dependent harmonic oscillators may be found in shortcuts to adiabaticity [3], quantized fields propagating in
dielectric media [4], Casimir e 
                                ect [5] and ion-laser interactions [6], where the time dependence is necessary in order
to trap the ion.
Time dependent harmonic oscillators have been extensively studied and several invariants have been obtained [7, 8, 9,
10, 11]. Also algebraic methods to obtain the evolution operator have been shown [12]. They have been solved under
various scenarios such as time dependent mass [12, 13, 14], time dependent frequency [15, 11] and applications of
invariant methods have been studied in di 
                                          erent regimes [16]. Such invariants may be used to control quantum noise
[17] and to study the propagation of light in waveguide arrays [18, 19]. Harmonic oscillators may be used in more
general systems such as waveguide arrays [20, 21, 22].
In this contribution, we use an operator approach to solve the one-dimensional Schr ¨odinger equation in the Bohm-
Madelung formalism of quantum mechanics. This formalism has been used to solve the Schr ¨odinger equation for
di
  erent systems by taking the advantage of their non-vanishing Bohm potentials [23, 24, 25, 26]. Along this work,
we show that a time dependent harmonic oscillator may be obtained by choosing a position dependent quadratic time
dependent phase and a Gaussian amplitude for the wavefunction. We solve the probability equation by using operator
techniques. As an example we give a rational function of time for the time dependent frequency and show that the
Bohm potential has di 
                      erent behavior for that functionality because an auxiliary function needed in the scheme,
namely the functions that solves the Ermakov equation, presents two di 
                                                                       erent solutions.
One-dimensional Madelung-Bohm approach
The main equation in quantum mechanics is the Schrodinger equation, that in one dimension and for a potential V(x;t)
is written as (for simplicity, we set }=1)
i@ (x;t)
@t=1
2m@2 (x;t)
@x2+V(x;t) (x;t) (1)arXiv:2101.05907v1  [quant-ph]  14 Jan 2021
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958