1

I have a list that looks like this:

stuff = ['\n', '<td><nobr>8h</nobr></td>', '\n', '<td><nobr>2021-04-02 14:27:44.729</nobr></td>', '\n', '<td class="text-right">1.73</td>;', '\n']

I am trying to clean it up so that it looks like this:

stuff = ["8h","2021-04-02 13:27:44.729","1.73"]

What I am trying to do is this:

for x in range(0,len(stuff),1):
     stuff[x] = stuff[x].replace("\n","")
     stuff[x] = stuff[x].replace("<td>","")

I am hoping to remove the characters if they are there. If not, I'm hoping that part will just be skipped.

The error message I am getting is

NoneType Object is not callable.

Any suggestions?

Edit #1:

I believe this has something to do with the \n values messing things up. I'm not sure if this is accurate, but that's my feeling.

  • Why `for x in range(0,len(stuff),1):` instead of `for x in stuff:`? Also, this could help: [Python code to remove HTML tags from a string](https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string). – GG. Apr 02 '21 at 23:43
  • I'll take a look at the link, but using for x in range(0,len(stuff),1) is just how I've always did it. Is there is a reason to use 1 over the other? – Chicken Sandwich No Pickles Apr 02 '21 at 23:45
  • I am thinking if you accidentally set stuff to None before you hit the loop. Have you tried stepping through the code with breakpoints and debugging it? Also, I am presuming that in your actual code, the second item in the array stuff is also a string. Right now only the \n is a string. – Druhin Bala Apr 02 '21 at 23:46
  • 1
    `for x in stuff` is cleaner - unless you specifically need the index for computation – Druhin Bala Apr 02 '21 at 23:47
  • you could use beautifulsoup in case you have installed already (seems like you webscraped these data). and then get text from each element of your list: soup = BeautifulSoup("8h", "lxml") soup.find("td").text – Je Je Apr 02 '21 at 23:54
  • your code doesnt bug for me. using pycharm, python 3.7.9 – Je Je Apr 03 '21 at 00:07
  • I'll give it a shot, thanks – Chicken Sandwich No Pickles Apr 03 '21 at 00:08
  • [Could not reproduce](https://ideone.com/1zH9hG) – interjay Apr 03 '21 at 00:13
  • Does this answer your question? [Strip HTML from strings in Python](https://stackoverflow.com/questions/753052/strip-html-from-strings-in-python) (strip html using answers from this question, then use [strip](https://docs.python.org/3.4/library/stdtypes.html?highlight=strip#str.strip) to remove the \n etc.) – Stuart Apr 03 '21 at 00:55

2 Answers2

1

I should say I'm definitely not proud of my code, but here is what I came up with:

import re
stuff = ['\n', '<td><nobr>8h</nobr></td>', '\n', '<td><nobr>2021-04-02 14:27:44.729</nobr></td>', '\n', '<td class="text-right">1.73</td>;', '\n']
def get_stuff(el):
    pattern1 = "<td><nobr>(?P<inner>.+)<\/nobr><\/td>"
    pattern2 = "<td class=(\s+)?\".+\"(\s+)?>(?P<inner>.+)\<\/td>"
    result1 = re.search(pattern1, el)
    result2 = re.search(pattern2, el)
    if result1:
        return result1.group("inner")
    if result2:
        return result2.group("inner")
last_list = list(map(get_stuff, stuff))
print( [x for x in last_list if x is not None])

Result

['8h', '2021-04-02 14:27:44.729', '1.73']

Update

So I came up with a better idea (still not proud of)

import re
stuff = ['\n', '<td><nobr>8h</nobr></td>', '\n', '<td><nobr>2021-04-02 14:27:44.729</nobr></td>', '\n', '<td class="text-right">1.73</td>;', '\n']
def get_stuff(el):
    pattern = "\<(\/)?nobr\>|\<(\/)?td(\s+)?(class(\s+)?\=(\s+)?\".+\"(\s?))?>|\\n|\;"
    a  = re.sub(pattern, "", el)
    return a
last_list = list(map(get_stuff, stuff))
print( [x for x in last_list if x != ''])

Result(still same):

['8h', '2021-04-02 14:27:44.729', '1.73']
TheFaultInOurStars
  • 3,464
  • 1
  • 8
  • 29
1

If my understanding is correct, you want to remove two types of contents:

  1. anything between < and >;
  2. a list of undesirable characters, e.g. \n or ;.

The below snippet does the job.


stuff = ['\n', '<td><nobr>8h</nobr></td>', '\n', '<td><nobr>2021-04-02 14:27:44.729</nobr></td>', '\n', '<td class="text-right">1.73</td>;', '\n']

import re
ans = []
for x in stuff:
    x = re.sub(r"<.*?>", "", x) # remove <>
    x = re.sub(r"(\n|;)", "", x) # remove unwanted characters
    if x: ans.append(x)

print(ans) 
Sam Lee
  • 21
  • 2