0

Hey I've been trying to extract a timestamp from a html page and I've tried looking at other methods but I can't seem to apply to my case. I'm trying to receive the timestamp for many messages but I can't get the data from the div.

          <div data-sigil="message-text" data-store='{"timestamp":1425541012960,"author":100004932254581,"uuid":"mid.1425541012942:e2ebd68467f39a6954"}' data-store-id="53666">
           <span>
            I'm a antibacterial
           </span>
           <div class="messageAttachments">
           </div>
          </div>

The code I'm using is this.

    timestamp = []
    soup = BeautifulSoup(open('Messenger.html', encoding='utf-8'), 'html.parser')
    div = soup.div
    timestamp.append = div.attrs['data-store']
    print(timestamp)

There are a number of timestamps I'm trying to list as well if that helps.

edit: here is the error message I'm receiving.

timestamp.append = div.attrs['data-store']
KeyError: 'data-store'

edit2: using a combination of both answers below I got it working thanks to everyone that helped :)

time = soup.find_all('div', {'data-sigil':'message-text'})
#print(len(time))
for i in range(len(time)):
    stamp = ast.literal_eval(time[i].attrs['data-store'])['timestamp']
    timestamp.append(stamp)
    #print(timestamp[i])
  • `timestamp.append(div.attrs['data-store'])`? What's the problem you have here? – Psidom Mar 13 '17 at 00:42
  • Sorry should have included error will add now. – jacob Bailey Mar 13 '17 at 00:48
  • You need to be more restrictive about your `div`, you are probably not finding the `div` node, you expect. You could use `id` or `class` to select which `div` you want from html. – Psidom Mar 13 '17 at 00:54
  • Thats probably my issue the 'data-sigil="message-text"' is always in front how do I use that to identify the lines? – jacob Bailey Mar 13 '17 at 00:58

2 Answers2

1

Using what has already been discussed here, you can convert the string into an actual dictionary using ast.literal_eval().

The part soup.div.attrs['data-store'] of the following code will get the data-store attribute from the div, ast.literal_eval() will convert the string into an actual dictionary and using the key ['timestamp'] we then obtain the corresponding value.

import ast
from bs4 import BeautifulSoup

timestamp = []
soup = BeautifulSoup(open('Messenger.html', encoding='utf-8'),
                     'html.parser')

stamp = ast.literal_eval(soup.div.attrs['data-store'])['timestamp']

timestamp.append(stamp)
print(timestamp)

Output:

[1425541012960]
Community
  • 1
  • 1
Spherical Cowboy
  • 565
  • 6
  • 14
  • That looks like a good solution to this however I'm getting the following error when I try to run it in my code. stamp = ast.literal_eval(soup.div.attrs['data-store'])['timestamp'] KeyError: 'data-store' – jacob Bailey Mar 13 '17 at 00:53
  • If there are many divs (some of which having the attribute timestamp, some not), you might want to use [findAll](https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html). Search for it on the linked page. – Spherical Cowboy Mar 13 '17 at 01:00
0

It's very likely you didn't select the div tag you meant to; You can use attributes to restrict the selection, for instance use data-store-id with find, you should get this exact div tag, since mostly the id is unique:

soup.find('div', {'data-store-id': '53666'}).attrs['data-store']
# '{"timestamp":1425541012960,"author":100004932254581,"uuid":"mid.1425541012942:e2ebd68467f39a6954"}'

Update:

with find_all, you can use a list comprehension and test if the div has the data-store attribute, if it does, collect the timestamp, if not, filter it out:

[div.attrs['data-store'] for div in soup.find_all('div') if div.has_attr('data-store')]
#['{"timestamp":1425541012960,"author":100004932254581,"uuid":"mid.1425541012942:e2ebd68467f39a6954"}']
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • That works well but when I try to use find_all to get more data it errors out with. TypeError: list indices must be integers or slices, not str – jacob Bailey Mar 13 '17 at 01:11
  • You get the error because find_all returns a list instead of node, and so it doesn't have any attribute. You have to loop through the list and collect the timestamp individually. – Psidom Mar 13 '17 at 01:19
  • I think you should accept the answer that solves the problem. – Psidom Mar 13 '17 at 01:29