1

I have a file where each line is a base64-encoded XML document. The decoded XML documents may contain new line characters. I would like to grep out each XML document containing a given word.

The problem is that, when I decode the lines of the file, I have multiple lines for each base64-encoded line and I cannot grep it any more. I need something like base64 decode + remove line breaks in one step.

How can I achieve that in the Linux shell? I have Python, Perl and awk available.

>cat fileContainingBase64EncodedXMLsInEachLine.txt | what should I write here?

Input:

PGZvbz4NCjxiYXIvPg0KPC9mb28+
PGZvbz4NCjxodWh1Lz4NCjwvZm9vPg==
PGZvbz4NCjxiYXJvbWV0ZXIvPg0KPC9mb28+

Expected Output

Let's say I want the XML documents containing 'bar'

<foo>
<bar/>
</foo>
<foo>
<barometer/>
</foo>

An example for my problem

>cat fileContainingBase64EncodedXMLsInEachLine.txt | base64 --decode | grep bar

Delivers:

<bar/>
<barometer/>

So I do not have the full xml documents containing bar and barometer.

Gábor Lipták
  • 9,646
  • 2
  • 59
  • 113

5 Answers5

3

Here's some Python code that accepts a filename followed by the search word on the commandline. As usual, if either arg contains spaces, it must be quoted.

import sys
from base64 import b64decode

fname, pattern = sys.argv[1:]
with open(fname) as f:
    for row in f:
        row = b64decode(row).decode()
        if pattern in row:
            print(row, end='\n\n')

Running this on your data with "bar" as the pattern arg gives:

<foo>
<bar/>
</foo>

<foo>
<barometer/>
</foo>

In order to practice my rather rusty awk skills, I decided to write an awk command line to do this. It uses the standard base64 command to do the decoding.

awk 'BEGIN{cmd="base64 -d"}; {print |& cmd; close(cmd,"to"); z=""; while(cmd |& getline s) z=z s "\n"; close(cmd); if (z~pat)print z}' pat='bar' testdata_b64.txt

You pass it the pattern using the pat argument, which can be a regex. You can send data to it via standard input, or you can give it one or more filenames on the commandline.

Note that regex patterns need double escaping, eg pat='\\<bar\\>' matches the word bar.

PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • 1
    definitely much better then my answer below. – Zapho Oxx May 23 '18 at 09:34
  • @ZaphoOxx +1 for self critic :) – Gábor Lipták May 23 '18 at 09:41
  • Question: how would it look like, if it would work with standard input? – Gábor Lipták May 23 '18 at 09:45
  • 1
    @GáborLipták Yes, you can hard-code `sys.stdin` as the filename, if you like. However, on Linux, you can use `/dev/stdin` to pass stdin as a filename. – PM 2Ring May 23 '18 at 10:19
  • I'm curious to know how the awk version compares in speed to my python version. awk is faster at simple text processing, but there's some overhead in that pipeline to the `base64` command. – PM 2Ring May 23 '18 at 10:42
  • @PM2Ring after processing some big files I had to realize, that awk is too slow. I needed to make some changes to the script to set output encoding, otherwise I could not redirect the output into a file. See https://stackoverflow.com/a/19146524/337621 – Gábor Lipták Jun 04 '18 at 14:48
  • @GáborLipták I see. You didn't mention encoding issues in your question, so I didn't worry about it in my answer. I've set UTF-8 as the encoding in my terminal (konsole), so my Python code normally does what I want. But maybe for your use case you should be writing to a named file instead of using redirection. You can pass an `encoding` keyword arg to `open` to ensure a specific encoding. – PM 2Ring Jun 04 '18 at 15:21
  • @PM2Ring no problem. This is something unusual for me, that a software behaves different if I redirect its output. I am not used to it. I have to learn more python :) – Gábor Lipták Jun 05 '18 at 07:01
1

update: if you know that the first node name is <foo>, then you can just do :

$ echo "<head>$(base -decode <file>)</head>" | \
  xmlstarlet sel -t -m '//bar/ancestor::foo' -c .

It selects the ancestor named foo of the node called bar, since foo is the first xml-node, it will select the requested xml file.

original answer below:

Using xmlstarlet you might want to do this

$ echo "<head>$(base -decode <file>)</head>" | \
  xmlstarlet sel -t -m '//bar/ancestor::*[last()-1]' -c .

This essentially selects the full xml-tree of ancestors of the node 'bar' but it will only go upto the correct depth.

I added an extra head node to make the full string a valid xml file. This way you only need to print from the first node onwards.

The echo would produce something like (slightly different version):

<head> 
  <foo /> 
  <foo> 
    <barometer /> 
  </foo> 
  <foo> 
    <DDD> 
      <BBB/> 
      <bar /> 
    </DDD> 
  </foo> 
</head>

xmlstarlet will do a template selection based on the xpath //bar/ancestor::*, leading to the following set of matches

  • <bar />
  • <DDD><BBB /><bar /></DDD>
  • <foo><DDD><BBB /><bar /></DDD></foo>
  • <head> everything </ head>

We are interested in the penultimate one, i.e. [last()-1] and we ask to print a copy of it -c .

kvantour
  • 25,269
  • 4
  • 47
  • 72
1

Perl to the rescue:

perl -MMIME::Base64 -nE '$_=decode_base64($_);/bar/&&say' fileContaining...txt

or

cat fileContaining...txt | perl -MMIME::Base64 -nE'$_=decode_base64($_);/bar/&&say'
Kjetil S.
  • 3,468
  • 20
  • 22
0

you can try the following python script. It is not a commandline onliner but this should give you what you want. For usage do:

>python3 get_xml.py SEARCHSTRING FILENAME

output for you example was:

<foo>
<bar/>
</foo>
<foo>
<barometer/>
</foo>

script:

import base64
import sys
script_name = sys.argv[0]
search_string = sys.argv[1]
filename = sys.argv[2]
print("[+] ({}) search for {}".format(script_name,search_string,filename))
with open(filename,"r") as xml_in:
    nextline = xml_in.readline()
    while nextline != '':
        xml = base64.b64decode(nextline).decode("utf-8").rstrip()
        if search_string in xml:
            print(xml)
        nextline = xml_in.readline()
Zapho Oxx
  • 275
  • 1
  • 16
0

You can use tr inside a loop to remove all new lines for each of the XML documents like this:

#!/bin/bash

while IFS='' read -r line
do
    echo -n "$line" | base64 --decode | tr -d '\r\n'
    echo
done < fileContainingBase64EncodedXMLsInEachLine.txt
martin_joerg
  • 1,153
  • 1
  • 13
  • 22
  • 2
    Please see [Why is using a shell loop to process text considered bad practice?](https://unix.stackexchange.com/q/169716/88378). The main reason is that `read` is very CPU-intensive: it issues a system call to the kernel for _each character_ it reads. – PM 2Ring May 23 '18 at 10:35