0

I was given help with the splitlines() function which worked perfect on string output which wasn't seperated by page numbers, see How to Create Spark or Pandas Dataframe from str output in Apache Spark on Databricks

I am now using str_output = result.pages as opposed to str_output = result.content

Now, when I execute

df_data = pd.DataFrame({'ColumnA':str_output.splitlines()})
df_data

I get the following error:

AttributeError: 'list' object has no attribute 'splitlines'

I think its because of the way that I'm using the splitlines function, but I'm not sure.

Any help appreciated

I should show the full code, see below:

import pandas as pd
from azure.ai.formrecognizer import DocumentAnalysisClient

# field_list = ["result.content"]

document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)

for blob in container.list_blobs():
  blob_url = container_url + "/" + blob.name
  poller = document_analysis_client.begin_analyze_document_from_url(
            "prebuilt-read", blob_url)
  result = poller.result()
  print("Scanning " + blob.name + "...")
  print ("document contains", result.content)

myoutput = result.pages

df_data = pd.DataFrame({'RAWTEXT':myoutput.splitlines()})
df_data

As resuesting, a sample of the data is as follows:

Scanning 05Jul11 Raet Prelim.pdf... document contains PRELIMINARY REPORT RAET HOLDING B.V. 5 JULY 2011 1 RæT CONTENTS 1 INVESTMENT PROPOSAL ............................................................................................................ 5 1.1 Background to business................................................................................................................ 5 1.2 Process ........................................................................................................................................ 6 1.2.1 Overview .............................................................................................................................. 6 1.2.2 Due Diligence ....................................................................................................................... 7 1.2.3 Banking / Financing .............................................................................................................. 8 1.2.4 Proposed Tactics / Recommendation .................................................................................... 8 1.3 Investment Overview .................................................................................................................... 9 1.3.1 Investment thesis .................................................................................................................. 9 1.3.2 Business Strengths ............................................................................................................... 9 1.3.3 Investment Case Returns .....................................................................................................11 1.4 Key judgment calls ......................................................................................................................12 1.5 Recommendation ........................................................................................................................18 2 MARKET AND BUSINESS

Patterson
  • 1,927
  • 1
  • 19
  • 56
  • The issue is that you're expecting `str_output` to be a string, but it's actually a list. You probably want a for loop like `for page in result.pages:` and to use `page.splitlines()` rather than `str_output.splitlines()`. Inserting a `print(type(str_output))` might also clarify things. – Sarah Messer Jun 08 '22 at 14:09
  • Hi Sarah, thanks so much for reaching out. I should point out that my coding skills aren't as advanced as your skills. I have updated the question with the fulll code. If you could show me where I ought to make the amendments that would be most helpful. Sorry for being lazy, but I need to produce some results quickly for my manager – Patterson Jun 08 '22 at 14:16

1 Answers1

0

Here str_output is a list while splitlines() is a function for string objects. If you just pass str_output as a value in the dictionary you shouldn't face this error.

df_data = pd.DataFrame({'ColumnA':str_output.splitlines()})

If this doesn't help then please put a sample of the data in str_output in the question.

Zero
  • 1,800
  • 1
  • 5
  • 16