Linux shell: Base64 Decode with removing line breaks

Question

I have a file where each line is a base64-encoded XML document. The decoded XML documents may contain new line characters. I would like to grep out each XML document containing a given word.

The problem is that, when I decode the lines of the file, I have multiple lines for each base64-encoded line and I cannot grep it any more. I need something like base64 decode + remove line breaks in one step.

How can I achieve that in the Linux shell? I have Python, Perl and awk available.

>cat fileContainingBase64EncodedXMLsInEachLine.txt | what should I write here?

Input:

PGZvbz4NCjxiYXIvPg0KPC9mb28+
PGZvbz4NCjxodWh1Lz4NCjwvZm9vPg==
PGZvbz4NCjxiYXJvbWV0ZXIvPg0KPC9mb28+

Expected Output

Let's say I want the XML documents containing 'bar'

<foo>
<bar/>
</foo>
<foo>
<barometer/>
</foo>

An example for my problem

>cat fileContainingBase64EncodedXMLsInEachLine.txt | base64 --decode | grep bar

Delivers:

<bar/>
<barometer/>

So I do not have the full xml documents containing bar and barometer.

Your description is not clear. Add more samples of input and output in your post and do let us know your efforts too on it too. — RavinderSingh13, May 23 '18 at 08:38
I hopefully made it clear now. Please give feedback if not. Thanks. — Gábor Lipták, May 23 '18 at 08:45
@Gábor: I don't understand why you think the embedded newlines prevent you from searching for the keyword you're looking for. Can you please show the code that you've written to decode the base64 data? — Borodin, May 23 '18 at 08:56
BTW, I wouldn't try doing this directly in the shell. It's possible to parse files line by line with Bash, but it's not very efficient. You'd be better off writing a small script in Python, awk, or perl. Show us some code, with a few lines of actual input & output, explain what it's doing wrong, and we can help you fix it. — PM 2Ring, May 23 '18 at 09:01

PM 2Ring · Accepted Answer · 2018-05-23T10:26:26.220

3

Here's some Python code that accepts a filename followed by the search word on the commandline. As usual, if either arg contains spaces, it must be quoted.

import sys
from base64 import b64decode

fname, pattern = sys.argv[1:]
with open(fname) as f:
    for row in f:
        row = b64decode(row).decode()
        if pattern in row:
            print(row, end='\n\n')

Running this on your data with "bar" as the pattern arg gives:

<foo>
<bar/>
</foo>

<foo>
<barometer/>
</foo>

In order to practice my rather rusty awk skills, I decided to write an awk command line to do this. It uses the standard base64 command to do the decoding.

awk 'BEGIN{cmd="base64 -d"}; {print |& cmd; close(cmd,"to"); z=""; while(cmd |& getline s) z=z s "\n"; close(cmd); if (z~pat)print z}' pat='bar' testdata_b64.txt

You pass it the pattern using the pat argument, which can be a regex. You can send data to it via standard input, or you can give it one or more filenames on the commandline.

Note that regex patterns need double escaping, eg pat='\\<bar\\>' matches the word bar.

edited May 23 '18 at 10:26

answered May 23 '18 at 09:28

PM 2Ring

54,345
6
82
182

1

definitely much better then my answer below. – Zapho Oxx May 23 '18 at 09:34
@ZaphoOxx +1 for self critic :) – Gábor Lipták May 23 '18 at 09:41
Question: how would it look like, if it would work with standard input? – Gábor Lipták May 23 '18 at 09:45
1

@GáborLipták Yes, you can hard-code `sys.stdin` as the filename, if you like. However, on Linux, you can use `/dev/stdin` to pass stdin as a filename. – PM 2Ring May 23 '18 at 10:19
I'm curious to know how the awk version compares in speed to my python version. awk is faster at simple text processing, but there's some overhead in that pipeline to the `base64` command. – PM 2Ring May 23 '18 at 10:42
@PM2Ring after processing some big files I had to realize, that awk is too slow. I needed to make some changes to the script to set output encoding, otherwise I could not redirect the output into a file. See https://stackoverflow.com/a/19146524/337621 – Gábor Lipták Jun 04 '18 at 14:48
@GáborLipták I see. You didn't mention encoding issues in your question, so I didn't worry about it in my answer. I've set UTF-8 as the encoding in my terminal (konsole), so my Python code normally does what I want. But maybe for your use case you should be writing to a named file instead of using redirection. You can pass an `encoding` keyword arg to `open` to ensure a specific encoding. – PM 2Ring Jun 04 '18 at 15:21
@PM2Ring no problem. This is something unusual for me, that a software behaves different if I redirect its output. I am not used to it. I have to learn more python :) – Gábor Lipták Jun 05 '18 at 07:01

kvantour · Answer 2 · 2018-05-23T11:33:56.297

update: if you know that the first node name is <foo>, then you can just do :

$ echo "<head>$(base -decode <file>)</head>" | \
  xmlstarlet sel -t -m '//bar/ancestor::foo' -c .

It selects the ancestor named foo of the node called bar, since foo is the first xml-node, it will select the requested xml file.

original answer below:

Using xmlstarlet you might want to do this

$ echo "<head>$(base -decode <file>)</head>" | \
  xmlstarlet sel -t -m '//bar/ancestor::*[last()-1]' -c .

This essentially selects the full xml-tree of ancestors of the node 'bar' but it will only go upto the correct depth.

I added an extra head node to make the full string a valid xml file. This way you only need to print from the first node onwards.

The echo would produce something like (slightly different version):

<head> 
  <foo /> 
  <foo> 
    <barometer /> 
  </foo> 
  <foo> 
    <DDD> 
      <BBB/> 
      <bar /> 
    </DDD> 
  </foo> 
</head>

xmlstarlet will do a template selection based on the xpath //bar/ancestor::*, leading to the following set of matches

<bar />
<DDD><BBB /><bar /></DDD>
<foo><DDD><BBB /><bar /></DDD></foo>
<head> everything </ head>

We are interested in the penultimate one, i.e. [last()-1] and we ask to print a copy of it -c .

score 1 · Answer 3 · answered May 23 '18 at 19:44

1

Perl to the rescue:

perl -MMIME::Base64 -nE '$_=decode_base64($_);/bar/&&say' fileContaining...txt

or

cat fileContaining...txt | perl -MMIME::Base64 -nE'$_=decode_base64($_);/bar/&&say'

answered May 23 '18 at 19:44

Kjetil S.

3,468
20
22

score 0 · Answer 4 · answered May 23 '18 at 09:10

you can try the following python script. It is not a commandline onliner but this should give you what you want. For usage do:

>python3 get_xml.py SEARCHSTRING FILENAME

output for you example was:

<foo>
<bar/>
</foo>
<foo>
<barometer/>
</foo>

script:

import base64
import sys
script_name = sys.argv[0]
search_string = sys.argv[1]
filename = sys.argv[2]
print("[+] ({}) search for {}".format(script_name,search_string,filename))
with open(filename,"r") as xml_in:
    nextline = xml_in.readline()
    while nextline != '':
        xml = base64.b64decode(nextline).decode("utf-8").rstrip()
        if search_string in xml:
            print(xml)
        nextline = xml_in.readline()

score 0 · Answer 5 · answered May 23 '18 at 09:15

0

You can use tr inside a loop to remove all new lines for each of the XML documents like this:

#!/bin/bash

while IFS='' read -r line
do
    echo -n "$line" | base64 --decode | tr -d '\r\n'
    echo
done < fileContainingBase64EncodedXMLsInEachLine.txt

answered May 23 '18 at 09:15

martin_joerg

1,153
1
13
22

2

Please see [Why is using a shell loop to process text considered bad practice?](https://unix.stackexchange.com/q/169716/88378). The main reason is that `read` is very CPU-intensive: it issues a system call to the kernel for _each character_ it reads. – PM 2Ring May 23 '18 at 10:35

Linux shell: Base64 Decode with removing line breaks

Input:

Expected Output

An example for my problem

5 Answers5