0

As an absolute novice to Python I relied on this thread in order to build script to read CSV data. The file content looked like this:

1,2,3

File was created in MS Excel and edited in Notepad++

The code used to read it came in two variants:

Variant 1:

import pandas as pd
url='https://drive.google.com/file/d/1WgaA_dIHYm3ogCUE4WDu1ocwgQSZ1Wc7/view?usp=sharing'
file_id = url.split('/')[-2]
dwn_url='https://drive.google.com/uc?id=' + file_id 
df = pd.read_csv(dwn_url)
print(df.head()))

Variant 2:

import pandas as pd
import requests
from io import StringIO

url='https://drive.google.com/file/d/1WgaA_dIHYm3ogCUE4WDu1ocwgQSZ1Wc7/view?usp=sharing'

file_id = url.split('/')[-2]
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url2 = requests.get(dwn_url).text
csv_raw = StringIO(url2)
df = pd.read_csv(csv_raw)
print(df.head())

Both returning well-known Error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 90 fields in line 3, saw 217

Using the read_csv(dwn_url, on_bad_lines='skip') I got following output:

<!doctype html><html lang="en-US" dir="ltr"><head><base href="https://accounts.google.com/v3/signin/"><meta name="referrer" content="origin"><link rel="canonical" href="https://accounts.google.com/v3/signin/identifier"><meta name="viewport" content="width=device-width  ... 0.149);border-radius:2px;bottom:0;content:"";left:0;position:absolute;right:0;top:0;z-index:-1}.JVMrYb{display:block}.hJIRO{display:none}.sQecwc{display:hidden}sentinel{}
0  /*# sourceURL=/_/mss/boq-identity/_/ss/k=boq-i...                                                                                                                                                                                                                            ...                                                NaN                                                                                                                        
1             Copyright The Closure Library Authors.                                                                                                                                                                                                                            ...                                                NaN                                                                                                                        
2                SPDX-License-Identifier: Apache-2.0                                                                                                                                                                                                                            ...                                                NaN                                                                                                                        
3                                                 */                                                                                                                                                                                                                            ...                                                NaN                                                                                                                        
4  'use strict';var d=function(a){var b=0;return ...                                                                                                                                                                                                                            ...                                                NaN                                                                                                                        

[5 rows x 90 columns]

Most importantly - this same output is provided when a completely empty CSV is imported, which makes me believe there is some technical data in CSV which Pandas reads (thus getting wrong count of columns from the very beginning).

Does anyone of community members have idea about the possible ways to resolve the issue?

441109
  • 1
  • your **not** using the google drive api for starters [python download google drive](https://developers.google.com/drive/api/guides/manage-downloads#python) – Linda Lawton - DaImTo Sep 02 '22 at 08:59
  • Thanks, this appears to be a logical proposition. However I run into a standard issue of scopes for an app, and information provide in [Authenticate your users](https://developers.google.com/drive/api/guides/about-auth) section of documentation concentrates more on the issue of requesting restricted scopes than a way to compare the scopes required by file and ones which are set through CLI for application. Haven't you encountered a discussion or description of this matter in respect to download_file.py file usage? – 441109 Sep 02 '22 at 16:57
  • If you the developer control the file, just use a service account, you wont have to worry about requesting permission of the user. – Linda Lawton - DaImTo Sep 02 '22 at 19:45
  • Thanks. I am in control of the CSV file - it is uploaded into the root GDrive folder of mine. The account is authorized through CLI auth login. Using the download_file.py suggested in documentation with `creds, _ = google.auth.default()` unchanged I run into "Insufficient Permission: Request had insufficient authentication scopes." Same happens in case I am trying to use the `creds, = google.auth.default(['https://www.googleapis.com/auth/drive.file'])` Seems to me I am getting some fundamental part of this process wrong. – 441109 Sep 04 '22 at 22:28
  • your not authorized to access the file. You need to work out authorization first – Linda Lawton - DaImTo Sep 05 '22 at 05:56

1 Answers1

0

Try specifying delimiter and header when using read_csv. As specified in this answer

  • Actually I forgot to mention that pd.read_csv(dwn_url, header=None) and pd.read_csv(dwn_url, sep =',') were tried and result did not change. Strictly speaking I would expect some other error to be reported when reading empty file (no delimiter and no head to speak of). – 441109 Sep 02 '22 at 09:23