How do I strip characters from substrings within a list generated from an XML xpath search?

Question

This question is a supplement to an earlier question. If you need further background, you can check out the original question here:

Populating Python list using data obtained from lxml xpath command.

I have incorporated @ihor-kaharlichenko 's excellent suggestion (from my original question) into modified code, here:

from lxml import etree as ET
from datetime import datetime

xmlDoc = ET.parse('http://192.168.1.198/Bench_read_scalar.xml')

response = xmlDoc.getroot()
tags = (
'address',
'status',
'flow',
'dp',
'inPressure',
'actVal',
'temp',
'valveOnPercent',
)

dmtVal = []

for dmt in response.iter('dmt'):
    val = [str(dmt.xpath('./%s/text()' % tag)) for tag in tags]
    val.insert(0, str(datetime.now())) #Add timestamp at beginning of each record
    dmtVal.append(val)

for item in dmtVal:
    str(item).strip('[')
    str(item).strip(']')
    str(item).strip('"')

This last block is where I am having problems. The data I am getting for dmtVal looks like:

[['2012-08-16 12:38:45.152222', "['0x46']", "['0x32']", "['1.234']", "['5.678']", "['9.123']", "['4.567']", "['0x98']", "['0x97']"], ['2012-08-16 12:38:45.152519', "['0x47']", "['0x33']", "['8.901']", "['2.345']", "['6.789']", "['0.123']", "['0x96']", "['0x95']"]]

However, I really want the data to look like this:

[['2012-08-16 12:38:45.152222', '0x46', '0x32', '1.234', '5.678', '9.123', '4.567', '0x98', '0x97'], ['2012-08-16 12:38:45.152519', '0x47', '0x33', '8.901', '2.345', '6.789', '0.123', '0x96', '0x95']]

I thought this was a fairly simple string stripping job, and I tried code inside the original iteration (where dmtVal was originally populated), but that didn't work, so I took the stripping operation outside the loop, as listed above, and it is still not working. I'm thinking I'm making some kind of noob-error, but can't find it. Any suggestions would be welcome!

Thanks to all of you for prompt and useful responses. Here is the corrected code:

from lxml import etree as ET
from datetime import datetime

xmlDoc = ET.parse('http://192.168.1.198/Bench_read_scalar.xml')

print '...Starting to parse XML nodes'

response = xmlDoc.getroot()

tags = (
'address',
'status',
'flow',
'dp',
'inPressure',
'actVal',
'temp',
'valveOnPercent',
)

dmtVal = []

for dmt in response.iter('dmt'):
    val = [' '.join(dmt.xpath('./%s/text()' % tag)) for tag in tags]
    val.insert(0, str(datetime.now())) #Add timestamp at beginning of each record
    dmtVal.append(val)

Which yields:

...Starting to parse XML nodes
[['2012-08-16 14:41:10.442776', '0x46', '0x32', '1.234', '5.678', '9.123', '4.567', '0x98', '0x97'], ['2012-08-16 14:41:10.443052', '0x47', '0x33', '8.901', '2.345', '6.789', '0.123', '0x96', '0x95']]
...Done

Thanks everyone!

sberry · Answer 1 · 2012-08-16T20:21:05.650

Given your current data as grps

SOLUTION 1 - ast.literal_eval

import ast
grps = [['2012-08-16 12:38:45.152222', "['0x46']", "['0x32']", "['1.234']", "['5.678']", "['9.123']", "['4.567']", "['0x98']", "['0x97']"], ['2012-08-16 12:38:45.152519', "['0x47']", "['0x33']", "['8.901']", "['2.345']", "['6.789']", "['0.123']", "['0x96']", "['0x95']"]]
desired_output = [[grp[0]] + [ast.literal_eval(item)[0] for item in grp[1:]] for grp in grps]

print desired_output

OUTPUT

[['2012-08-16 12:38:45.152222', '0x46', '0x32', '1.234', '5.678', '9.123', '4.567', '0x98', '0x97'], ['2012-08-16 12:38:45.152519', '0x47', '0x33', '8.901', '2.345', '6.789', '0.123', '0x96', '0x95']]

Explanation

ast.literal_eval is a safe way to do eval. It will only work to eval datatypes (strings, numbers, tuples, lists, dicts, booleans, and None). In your case it will evaluate "['1.0']" to be a list of length 1, like ['1.0']. You will probably want to take a look at, and make sure you understand list comprehensions.

The other way to write this would have been:

desired_output = []
for grp in grps:  # loop through each group
    new_grp = grp[0]  # assign the first element (an array) to be our new_grp
    for item in grp[1:]  # loop over every item from index 1 to the end
        evaluated_item = ast.literal_eval(item)  # get the evaluated data
        new_grp.append(evaluated_item[0])  # append the item in the 1 item list to the new_grp
    desired_output.append(new_grp)  # append the new_grp to the desired_output list

SOLUTION 2 - regular expressions

import re
stripper = re.compile("[\[\]']")
grps = [['2012-08-16 12:38:45.152222', "['0x46']", "['0x32']", "['1.234']", "['5.678']", "['9.123']", "['4.567']", "['0x98']", "['0x97']"], ['2012-08-16 12:38:45.152519', "['0x47']", "['0x33']", "['8.901']", "['2.345']", "['6.789']", "['0.123']", "['0x96']", "['0x95']"]]
desired_output = [[grp[0]] + [ stripper.sub('', item) for item in grp[1:]] for grp in grps]

The problem with your solution, is that the items being iterated over in the for loop are not passed by reference, so changing them does not affect the original data.

SOLUTION 3 - fix for your original code

To fix your solution, you would do:

for i, grp in enumerate(dmtVal):  # loop over the inner lists
    for j, item in enumerate(grp):
        dmtVal[i][j] = item.strip('\]')
        dmtVal[i][j] = dmtVal[i][j].lstrip('\[')
        dmtVal[i][j] = dmtVal[i][j].strip("'")

Instead of assigning the balue balue to dmtVal[i][j] each time you strip, you could instead use the dereferenced value item, manipulate it, then assign back to dmtVal[i][j] at the end.

for i, grp in enumerate(dmtVal):  # loop over the inner lists
    for j, item in enumerate(grp):
        # Could intead be
        item = item.strip('\]')
        item = dmtVal[i][j].lstrip('\[')
        item = dmtVal[i][j].strip("'")
        dmtVal[i][j] = item

Or a better solution (imho):

for i, grp in enumerate(dmtVal):  # loop over the inner lists
    for j, item in enumerate(grp):
        dmtVal[i][j] = item.replace('[', '').replace(']', '').replace("'", '')

That code works, but it isn't the most pedagogical answer I have ever seen. Do we want this site to give people digital voodoo incantations, or to help the askers understand the solution? — JosefAssad, Aug 16 '12 at 19:45
@JosefAssad: right you are. I am adding an explanation of both solutions. — sberry, Aug 16 '12 at 19:58
Thanks sberry, for 3 good solutions. Thanks, @JosefAssad , for encouraging good explanations, it sure makes understanding (vs. cut+paste copying) much easier. — Red Spanner, Aug 16 '12 at 20:57

score 1 · Answer 2 · answered Aug 16 '12 at 19:59

This'll do what you need it to, maybe not the fanciest of ways though:

new_dmt_val = []
for sublist in dmtVal:
    new_dmt_val.append([elem.strip('[\'').strip('\']') for elem in sublist])

Tried to make it readable, it's probably doable in fewer, but more confusing lines.

score 1 · Accepted Answer · answered Aug 16 '12 at 20:33

The answer is: Don't create the strings in the first place.

Your problem is in this part of the code:

for dmt in response.iter('dmt'):
    val = [str(dmt.xpath('./%s/text()' % tag)) for tag in tags]

I'm guessing you used str() here to try to extract the string from the list xpath() returns.
However, this is not what you're getting; str() just gives you a string representation of the list.

You have a few choices to do what you want.
But given you're parsing html, and therefor can't know for sure how many elements the list will contain, your best option is probably using ''.join():

for dmt in response.iter('dmt'):
    val = [''.join(dmt.xpath('./%s/text()' % tag)) for tag in tags]

EDIT: You won't need your last loop if you use this code.

Thanks for an elegant solution that solves the root-cause problem -- an xpath output that wasn't quite right. — Red Spanner, Aug 16 '12 at 21:02

randomfigure · Answer 4 · 2012-08-16T23:00:07.183

1

string.strip only strips leading and trailing chars. You may want to use string.replace instead. Also, note that, string.strip (and string.replace) return a copy of the string.

or simply use ''.join() in place of str and forgo the whole stripping business completely:

val = [''.join(dmt.xpath('./%s/text()' % tag)) for tag in tags]

as a side note, you probably want to use datetime.isoformat instead of str too:

val.insert(0, datetime.now().isoformat()) #Add timestamp at beginning of each record

see help(datetime) for more options

edited Aug 16 '12 at 23:00

answered Aug 16 '12 at 20:59

randomfigure

420
5
13

Thanks for the suggestion about `isoformat` -- I'm going to go look at it right now. – Red Spanner Aug 16 '12 at 21:10
I've added the `isoformat` option to the code in this script, but also added it to code in a related module, which is also collecting data from an instrument (no XML on this one!) - Thanks! – Red Spanner Aug 16 '12 at 23:54

score 1 · Answer 5 · answered Aug 16 '12 at 21:02

1

Where xml is the string of your original post... (I think this covers both in a way...)

from lxml import etree
from datetime import datetime
from ast import literal_eval

tree = etree.fromstring(xml).getroottree()
dmts = []
for dmt in tree.iterfind('dmt'):
    to_add = {'datetime': datetime.now()}
    to_add.update( {n.tag:literal_eval(n.text) for n in dmt} )
    dmts.append(to_add)

You can still order the nodes explicitly later - although I find this approach clearer as you can just use names rather than indexing (this all depends whether the introduction or removal of a node should be an error though)

answered Aug 16 '12 at 21:02

Jon Clements

138,671
33
247
280

Hi Jon, Thanks for this -- I've implemented the join approach suggested by @stranac above. The ast module looks like a sophisticated approach to the solution, but also complex, and I'm still a relative Python noob! – Red Spanner Aug 16 '12 at 21:09
@RedSpanner We've all been "noobs" in a language at some point - I hope you enjoy your experience in Python – Jon Clements Aug 16 '12 at 22:28

How do I strip characters from substrings within a list generated from an XML xpath search?

5 Answers5