AttributeError: 'list' object has no attribute 'splitlines' when converting code to Pandas Dataframe using function splitlines()

Question

I was given help with the splitlines() function which worked perfect on string output which wasn't seperated by page numbers, see How to Create Spark or Pandas Dataframe from str output in Apache Spark on Databricks

I am now using str_output = result.pages as opposed to str_output = result.content

Now, when I execute

df_data = pd.DataFrame({'ColumnA':str_output.splitlines()})
df_data

I get the following error:

AttributeError: 'list' object has no attribute 'splitlines'

I think its because of the way that I'm using the splitlines function, but I'm not sure.

Any help appreciated

I should show the full code, see below:

import pandas as pd
from azure.ai.formrecognizer import DocumentAnalysisClient

# field_list = ["result.content"]

document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)

for blob in container.list_blobs():
  blob_url = container_url + "/" + blob.name
  poller = document_analysis_client.begin_analyze_document_from_url(
            "prebuilt-read", blob_url)
  result = poller.result()
  print("Scanning " + blob.name + "...")
  print ("document contains", result.content)

myoutput = result.pages

df_data = pd.DataFrame({'RAWTEXT':myoutput.splitlines()})
df_data

As resuesting, a sample of the data is as follows:

Scanning 05Jul11 Raet Prelim.pdf... document contains PRELIMINARY REPORT RAET HOLDING B.V. 5 JULY 2011 1 RæT CONTENTS 1 INVESTMENT PROPOSAL ............................................................................................................ 5 1.1 Background to business................................................................................................................ 5 1.2 Process ........................................................................................................................................ 6 1.2.1 Overview .............................................................................................................................. 6 1.2.2 Due Diligence ....................................................................................................................... 7 1.2.3 Banking / Financing .............................................................................................................. 8 1.2.4 Proposed Tactics / Recommendation .................................................................................... 8 1.3 Investment Overview .................................................................................................................... 9 1.3.1 Investment thesis .................................................................................................................. 9 1.3.2 Business Strengths ............................................................................................................... 9 1.3.3 Investment Case Returns .....................................................................................................11 1.4 Key judgment calls ......................................................................................................................12 1.5 Recommendation ........................................................................................................................18 2 MARKET AND BUSINESS

The issue is that you're expecting `str_output` to be a string, but it's actually a list. You probably want a for loop like `for page in result.pages:` and to use `page.splitlines()` rather than `str_output.splitlines()`. Inserting a `print(type(str_output))` might also clarify things. — Sarah Messer, Jun 08 '22 at 14:09
Hi Sarah, thanks so much for reaching out. I should point out that my coding skills aren't as advanced as your skills. I have updated the question with the fulll code. If you could show me where I ought to make the amendments that would be most helpful. Sorry for being lazy, but I need to produce some results quickly for my manager — Patterson, Jun 08 '22 at 14:16

score 0 · Answer 1 · answered Jun 08 '22 at 14:17

0

Here str_output is a list while splitlines() is a function for string objects. If you just pass str_output as a value in the dictionary you shouldn't face this error.

df_data = pd.DataFrame({'ColumnA':str_output.splitlines()})

If this doesn't help then please put a sample of the data in str_output in the question.

answered Jun 08 '22 at 14:17

Zero

1,800
1
5
16

Zero, I did the following df_data = pd.DataFrame({'RAWTEXT':myoutput}) and I got the following output ```0 DocumentPage(kind=document, page_number=1, ang... 1 DocumentPage(kind=document, page_number=2, ang...``` – Patterson Jun 08 '22 at 14:20
Any further help much appreciated. – Patterson Jun 08 '22 at 14:20
@Patterson Please add this to the question for more clarity. – Zero Jun 08 '22 at 14:23
Zero, I have added some sample output. Just so you, there are 45 pages in total for this particular document – Patterson Jun 08 '22 at 14:31
Hi, did my sample help? Or make things worse? – Patterson Jun 08 '22 at 14:45

AttributeError: 'list' object has no attribute 'splitlines' when converting code to Pandas Dataframe using function splitlines()

1 Answers1