There are two types of files, binary files and plain-text files. A file can have one or the other, or sometimes both.
Html files are plaintext, human readable files, which you can edit by hand, but PDF Files are binary + Text files where you'll need special programs to edit them.
If you want to read from pdf or html, it's possible. I wasn't sure if you meant to extract the text, or to extract the source code, so I'll provide explanations to both.
Extracting Text
Extracting text can be done easily for html files. Using webbrowser
, you can open your file in the browser, and then use urllib for extracting text. For more info, refer to the answers here: Extracting text from HTML file using Python
For pdf files, you can use a python module called PyPDF2. Download it using pip:
$ pip install PyPDF2
and get started.
Here is an example of a simple program I found on the internet:
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Extracting Source Code
Extracting source code is best done using python's open
function as you did above.
For html files, you can just do what you did with text files. Or maybe to be simpler,
file = open("c:\\path\\to\\file")
print(file.read())
you can just do the above.
For pdf files, you do pretty much the same, but specifying the mode for editing in a different parameter in the open
function. For more info, visit the sites in the More Info section.
file = open("c:\\path\\to\\file.extension", "a") #specifies the mode of editing. Unfortunately, you'll only be able to store data, not display it. But you can edit it, then save it after wards
print(file.readable()) #Will return false, proving to be not readable.
file.save("c:\\path\\to\\save\\in.extension")
More Info