0

I'm retrieving a file from GitLab (which sends files base64 encoded) that contains the tree seen below:

$ tree  
.  
├── LICENSE.md  
├── README.md  
├── manifest.yaml  
├── src  
│   ├── workflow.yaml  
│   ├── workflow.json  
│   └── resources  
│       ├── action1.yaml  
│       ├── action2.yaml  
│       ├── subworkflow.yaml  
│       └── template.yaml  
└── install  
    └── workflow.zip  

How can I decode this and keep the special characters (└,├,─)?

I've tried straight decoding and converting the bytes to string:

def decode(data):
  decoded = base64.b64decode(data)
  return "".join(chr(x) for x in bytearray(decoded))

That gives the wrong characters:

$ tree
.
â??â??â?? LICENSE.md
â??â??â?? README.md
â??â??â?? manifest.yaml
â??â??â?? src
â??   â??â??â?? workflow.yaml
â??   â??â??â?? workflow.json
â??   â??â??â?? resources
â??       â??â??â?? action1.yaml
â??       â??â??â?? action2.yaml
â??       â??â??â?? subworkflow.yaml
â??       â??â??â?? template.yaml
â??â??â?? install
    â??â??â?? workflow.zip

Then tried converting to utf-8 bytes first, but that converts them to question marks:

def decode_data(b64_data):
    b64_bytes = b64_data.encode('utf-8')
    data_bytes = base64.b64decode(b64_bytes)
    return data_bytes.decode('utf-8')

What else can I try? I'm using python3.6

Response for GitLab when retrieving the file:

{
    "file_name": "README.md",
    "file_path": "README.md",
    "size": 390,
    "encoding": "base64",
    "content_sha256": "b3cbc43cae23d77e09b60a2cb89ce76b969024ba6621548b6d3bc5b2b60380da",
    "ref": "master",
    "blob_id": "b7d9985ac1378310d31b9a8bccfc2cce5ad9655a",
    "commit_id": "bbc71c23657249a924174eca9b025d64b10518fc",
    "last_commit_id": "bbc71c23657249a924174eca9b025d64b10518fc",
    "content": "IyBUZXN0R2l0CgpgYGAKJCB0cmVlCi4K4pSc4pSA4pSAIExJQ0VOU0UubWQK4pSc4pSA4pSAIFJFQURNRS5tZArilJzilIDilIAgbWFuaWZlc3QueWFtbArilJzilIDilIAgc3JjCuKUgsKgwqAg4pSc4pSA4pSAIHdvcmtmbG93LnlhbWwK4pSCwqDCoCDilJzilIDilIAgd29ya2Zsb3cuanNvbgrilILCoMKgIOKUlOKUgOKUgCByZXNvdXJjZXMK4pSCwqDCoCAgICAg4pSc4pSA4pSAIGFjdGlvbjEueWFtbArilILCoMKgICAgICDilJzilIDilIAgYWN0aW9uMi55YW1sCuKUgsKgwqAgICAgIOKUnOKUgOKUgCBzdWJ3b3JrZmxvdy55YW1sCuKUgsKgwqAgICAgIOKUlOKUgOKUgCB0ZW1wbGF0ZS55YW1sCuKUlOKUgOKUgCBpbnN0YWxsCiDCoMKgIOKUlOKUgOKUgCB3b3JrZmxvdy56aXAKYGBg"
}
jeffreyb
  • 143
  • 11
  • We can't guess the encoding of the original file or which bytes you are reading from it. Please [edit] to show unambiguously what your input is. See https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors – tripleee Oct 26 '20 at 17:36
  • Where does `data` come from and where does base64 encoding fit into the picture here? Neither the input not the output seems to contain any base64. – tripleee Oct 26 '20 at 17:37

1 Answers1

0

The base64 string decodes to a bunch of bytes that you need to decode again, this time with utf-8:

In []: def decode2(data):
  ...:   decoded = base64.b64decode(data)
  ...:   return decoded.decode()
  ...:

In []: print(decode2("IyBUZXN0R2l0CgpgYGAKJCB0cmVlCi4K4pSc4pSA4pSAIExJQ0V
  ...: OU0UubWQK4pSc4pSA4pSAIFJFQURNRS5tZArilJzilIDilIAgbWFuaWZlc3QueWFtb
  ...: ArilJzilIDilIAgc3JjCuKUgsKgwqAg4pSc4pSA4pSAIHdvcmtmbG93LnlhbWwK4pS
  ...: CwqDCoCDilJzilIDilIAgd29ya2Zsb3cuanNvbgrilILCoMKgIOKUlOKUgOKUgCByZ
  ...: XNvdXJjZXMK4pSCwqDCoCAgICAg4pSc4pSA4pSAIGFjdGlvbjEueWFtbArilILCoMK
  ...: gICAgICDilJzilIDilIAgYWN0aW9uMi55YW1sCuKUgsKgwqAgICAgIOKUnOKUgOKUg
  ...: CBzdWJ3b3JrZmxvdy55YW1sCuKUgsKgwqAgICAgIOKUlOKUgOKUgCB0ZW1wbGF0ZS5
  ...: 5YW1sCuKUlOKUgOKUgCBpbnN0YWxsCiDCoMKgIOKUlOKUgOKUgCB3b3JrZmxvdy56a
  ...: XAKYGBg"))
# TestGit

```
$ tree
.
├── LICENSE.md
├── README.md
├── manifest.yaml
├── src
│   ├── workflow.yaml
│   ├── workflow.json
│   └── resources
│       ├── action1.yaml
│       ├── action2.yaml
│       ├── subworkflow.yaml
│       └── template.yaml
└── install
    └── workflow.zip
```

In case it is throwing UnicodeEncodeError: 'ascii' codec can't encode characters in position ... when outputting to console, you have to set the PYTHONIOENCODING environment variable to utf-8:

  • Unix: export PYTHONIOENCODING=utf-8
  • Windows: set PYTHONIOENCODING=utf-8
Czaporka
  • 2,190
  • 3
  • 10
  • 23
  • When I try that, I get a UnicodeEncodeError: 'ascii' codec can't encode characters in position 24-26: ordinal not in range(128) – jeffreyb Oct 26 '20 at 19:21
  • Quite strange that it seems to be trying to encode rather than decode. Anyway, can you try `.decode("utf-8")` instead of just `.decode()`? – Czaporka Oct 26 '20 at 19:33
  • It's the same error. Is it possible this has to do with my console? I can't even paste that output – jeffreyb Oct 26 '20 at 19:43
  • According to a comment below [this answer](https://stackoverflow.com/a/9942822/14222267) yes it may be associated with your console. So perhaps you should try executing `export PYTHONIOENCODING=utf-8` prior to running the script, unless you're on Windows, in which case I think it's `set` rather than `export`. – Czaporka Oct 26 '20 at 19:49
  • That did it. Please add that to you answer and I'll mark it as good :) – jeffreyb Oct 26 '20 at 19:53
  • Great to hear that it worked for you. I've updated the answer. – Czaporka Oct 26 '20 at 20:10