reading in multiple text file extensions .pdf, .txt and .htm

Question

I have a folder where I want to read all the text files from and put them into a corpus, however I am only able to do it with .txt files. How can I expand the code below to read in .pdf, .htm and .txt files?

corpus_raw = u""
    for file_name in file_names:
        with codecs.open(file_name, "r", "utf-8") as file_name:
            corpus_raw += file_name.read()
        print("Document is {0} characters long".format(len(corpus_raw)))
        print()

For example:

with open ('/data/text_file.txt', "r", encoding =  "utf-8") as f:
    print(f.read())

Read in data where I can view it on a notebook.

with open ('/data/text_file.pdf', "r", encoding =  "utf-8") as f:
    print(f.read())

Read nothing.

Doing a `read` on a `.pdf` or `.html` and appending what you get to a string won't give you the results you want. They are not textfiles and contain far more than just a series of words. And a `.pdif` is a compressed binary file from which the text will need to be extracted. For `.html.` look at the module `BeautifulSoup`. For `.pdf`s try `textract`. — BoarGules, Jul 13 '19 at 15:38

score 1 · Accepted Answer · answered Jul 13 '19 at 16:34

There are two types of files, binary files and plain-text files. A file can have one or the other, or sometimes both.

Html files are plaintext, human readable files, which you can edit by hand, but PDF Files are binary + Text files where you'll need special programs to edit them.

If you want to read from pdf or html, it's possible. I wasn't sure if you meant to extract the text, or to extract the source code, so I'll provide explanations to both.

Extracting Text

Extracting text can be done easily for html files. Using webbrowser, you can open your file in the browser, and then use urllib for extracting text. For more info, refer to the answers here: Extracting text from HTML file using Python

For pdf files, you can use a python module called PyPDF2. Download it using pip: $ pip install PyPDF2 and get started. Here is an example of a simple program I found on the internet:

import PyPDF2 

# creating a pdf file object 
pdfFileObj = open('example.pdf', 'rb') 

# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

# printing number of pages in pdf file 
print(pdfReader.numPages) 

# creating a page object 
pageObj = pdfReader.getPage(0) 

# extracting text from page 
print(pageObj.extractText()) 

# closing the pdf file object 
pdfFileObj.close()

Extracting Source Code

Extracting source code is best done using python's open function as you did above. For html files, you can just do what you did with text files. Or maybe to be simpler,

file = open("c:\\path\\to\\file")
print(file.read())

you can just do the above.

For pdf files, you do pretty much the same, but specifying the mode for editing in a different parameter in the open function. For more info, visit the sites in the More Info section.

file = open("c:\\path\\to\\file.extension", "a") #specifies the mode of editing. Unfortunately, you'll only be able to store data, not display it. But you can edit it, then save it after wards
print(file.readable()) #Will return false, proving to be not readable.
file.save("c:\\path\\to\\save\\in.extension")

More Info

score -1 · Answer 2 · edited Jul 13 '19 at 15:50

This should work for htm/html files with no problem - they are basically just text files. Above, I only see that reading in .pdf has failed - was there a problem with .htm?

Also, reading in a .pdf may be much more difficult/involved than you think. A pdf contains a lot more information than just plaintext, and cannot be meaningfully edited in, say, notepad. As an example of what I mean, here's a small sample of what I got when I opened a .pdf in notepad:

%PDF-1.7
%âãÏÓ
1758 0 obj
<</Filter/FlateDecode/First 401/Length 908/N 51/Type/ObjStm>>stream
hÞ”ØQk\7à¿2ÍK,i4
Cã(Á”¾•–öâ.Ýn‚w]òó3rm˜Ÿ =ÄÜÝèÎ‘®?ÉÍ…e¦ê?Å/2e¥ÂJÙˆ+SÉT«ù7$"T„ZËT”´ù2£®L~©¯fÊ©±É–iÌ(¦ÄF¹&OðÑ’Œ|hnžU}Žñ¾®ûDOÉæCÄç'¿IF¸/±Å¿”±/ÿ!¾›Ú˜Æ>¤ùeiêóuÚ3õ®äUÌ˜Ô·’Ìhì´$!_Êœ3©oúaÇÖÅÏç·rGòuê‡Gé¾é>Žà›ì¾õä›ò£Õì›ðÑµx¨ùQXÇ3ð'åC=ªJÃ6óç:¯ÖýÂ—ZòóúI¹ù…Ÿ3—ñ$<Éw‘èÍ›«›/Ç³/¸z¿¿?Ço'ÑoW¿îÆõXçŸ®¯}Ý»ítþ#?~ö¥ç_ü”×éÓÕÇíÛyü6Ç÷·»ûÍ‡åòøé÷ýù°ýôöá´?n§}8ž·Ãa·ÿÜ>ßÞo‡ý¿§Wat£õ…Ñ~ûÏ[ýQÌÍß»¯çížRŽI
$L’ù¤“úËI%Ã$OâTHb˜dóI5&$(éé´SI“€ˆE”-&Š("4&E”=$1ÁPDYa1   ˆ`(‚çEä“€†"x^DŽÁ@C</"ÇŽ` ¢B</"ÇŽ¨@D…"x^DŽQˆ
EÔ±#*Q¡ˆº "vD"*QDÄŽ¨@„@uADì"Š¨"bG!P„Ì‹(±#ˆ(BæE”ØD!ó"Jì"!ó"JìˆD4(BæE”Ø
ˆhPD[;¢
Šh"bG4 ¢AmADìˆD(ÑDÄŽP B¡ˆ¶ "v„
Eè¼Ž¡@„B:/‚cG(¡P„Î‹àØ
Dt(BçEpìˆDt(BçEpìˆDt(¢/ˆˆÑˆEô±#:Ñ¡ˆ¾ "vD"Šè"bGaPD_;Â€ƒ"l^Da@„A6/¢ÆŽ0 Â ›QcG1Þ¡¨y5–DN    eA6¢Ö‹¬‚² ‹ç#O…ÉEzQ•ð›ª´@£]„¡wU ¿¬J:ô"ñPüŸÑçSÿ(íÃñ¯íÛÿA?û°§7¿8ìBÀawü‡nww›ßû]€ %“xw
endstream
endobj
1759 0 obj
<</Filter/FlateDecode/First 1907/Length 3450/N 200/Type/ObjStm>>stream

There are, however, options. I would suggest reading the page at https://www.geeksforgeeks.org/working-with-pdf-files-in-python/ as a starting point.

As insightful this answer might be, this still doesn't solve the problem presented rather giving a direction to a solution. Content-wise, this should be converted to a comment as it is simply not an *answer*... — Tomerikoo, Jul 13 '19 at 16:10

reading in multiple text file extensions .pdf, .txt and .htm

2 Answers2

Extracting Text

Extracting Source Code

More Info