0

I'm using Python (I'm open to other languages, Java or C++, I just arbitrarily chose Python) to web-scrape a pdf from a URL. I don't have a problem using get requests or acquiring the binary data, but in order to convert it to usable text, I'm using PyPDF2. The problem is that I need to read through several hundred files and writing to a pdf file on my disk and then reading it is extremely slow, and the process takes over three minute each time.

I tried to use StringIO, but I had problems installing the module, and it seems like it's really outdated. Ideally I'd like a module that can convert raw binary info from a get request to meaningful text. Does anyone know a module like that?

Edit: The question was closed because a 14 year old article was linked as answering my question, but it did not. tripleee answered my question, and I was successful with using the native Python IO module.

Zayenus
  • 1
  • 1
  • 1
    Welcome to stackoverflow. Please checkout the [tour], [opinion based questions](https://meta.stackexchange.com/q/201994) and [ask] – Marcelo Paco Apr 13 '23 at 04:56
  • 1
    `StringIO` is part of the standard library; you should not need to install it separately at all. It was refactored to be in [the `io` module](https://docs.python.org/3/library/io.html) in Python 3, though; were you somehow trying to use the legacy Python 2 module? For binary data, you probably actually want the `BytesIO` class. – tripleee Apr 13 '23 at 05:08
  • 1
    "The question was closed because a 14 year old article was linked as answering my question, but it did not. tripleee answered my question, and I was successful with using the native Python IO module." - what? The comment by tripleee is essentially the same as the linked article. Tripleee first commented and 3 minutes later found the duplicate. He would certainly not close it as a duplicate if the answers over there wouldn't match his comment. – Thomas Weller Apr 14 '23 at 08:03

0 Answers0