1

I'm working with a csv I'm fetching online with requests.get, so for context this is how the file is being uploaded:

import pandas as pd
import requests

comments = []
body = requests.get()
for comment in body:
    comments.append([
                str(body['data']['body']).encode(encoding='utf-8')
            ])
df = pd.DataFrame(comments)[0]
requests.put('http://sample/desination.csv', data=df.to_csv(index=False))

The encoding when appending to comments is required when using requests because it defaulted to latin-1 and requests is expecting utf-8.

The resulting csv contains 1 column with rows like: b'Presicely'

Makes sense, encoding to utf-8 converted the string to bytes type.

Now where I'm later trying to decode the csv I have the following:

import requests

data = requests.get('http://destination.csv').content
testdata = data.decode('utf-8').splitlines()
print(testdata[2])

b'Presicely'

If I don't decode:

print(data[1:20])

b'Presicely'\r\n

I was under the impression that decoding data would eliminate the b prefixes, as most stackoverflow answers suggest. The problem could be with how I initially upload the csv, so I've tried tackling that a few different ways with no luck (can't get around encoding it).

Any suggestions?

P.S. python version 3.7.7

Edit: I ended up having no luck trying to get this to work. DataFrame.to_csv() returns a string and as lenz pointed out the conversion to string type is likely the culprit of the issue.

Ultimately I ended up saving the data as a .txt to eliminate the need to call to_csv(), which led to my decode to work as expected confirming our suspicion. The txt file format works for me so I'm keeping it that way.

  • Probably there's an (implicit) `str` call somewhere, so the values really are `"b'Precisely'"` and `"b'Precisely'\r\n"`. – lenz Jul 30 '20 at 07:10
  • By serialising a list of bytes objects (rather than first serialising, then encoding the whole dump), you probably need to also decode each cell individually too. – lenz Jul 30 '20 at 07:11
  • `df.to_csv(encoding='utf-8')`? – snakecharmerb Jul 30 '20 at 08:20
  • @snakecharmerb just tried doing this both with/without decoding the body but the results were the same. – AlwaysLearning Jul 30 '20 at 12:19
  • @lenz you're right in that to_csv returns a str object, so that may be where the problem lies. However when I try to decode the entire body as such: datadf = pd.read_csv(io.StringIO(data.decode('utf-8'))) I can then fetch a cell: testdata = datadf.iloc[1,0] but then that cell is already a string which can't be further decoded. Are you suggesting I convert it to another type to decode it further, on each row? – AlwaysLearning Jul 30 '20 at 12:28
  • I'm not sure what to do. But once you call `str()` on a bytes object without an `encoding=` parameter, you get to a representation like `"b'...'"`, which is not easily reverted, so you need to find a way to avoid this. Encoding individual cells doesn't seem promising to me. – lenz Jul 30 '20 at 14:53
  • Possibly relevant https://stackoverflow.com/a/55898249/5320906 – snakecharmerb Aug 01 '20 at 08:55

1 Answers1

1

I was able to get this to work, credit to my irl friend who rubber ducked me through the solution. It was quite simple, what I needed to do was encode the resulting string from to_csv function like so:

comments = []
body = requests.get()
for comment in body:
    comments.append([
            str(body['data']['body'])
        ])
df = pd.DataFrame(comments)[0]
csv_data = df.to_csv(index=False)
csv_data = csv_data.encode('utf-8')
requests.put('http://sample/desination.csv', data=csv_data)

I'm sure you can compress the above code by combining encode to either to_csv function as a flag or applying it to the result.

The resulting file uploaded can now be decoded properly and you can keep your csv format.