Removing string extra characters via python string functions

Question

Here is the web CSS from which I want to extract the Location information.

<div class="location">
    <div class="listing-location">Location</div>
    <div class="location-areas">
    <span class="location">Al Bayan</span>
    ‪,‪
    <span class="location">Nepal</span>
    </div>
    <div class="area-description"> 3.3 km from Mall of the Emirates </div>
    </div>

Python Beautuifulsoup4 Code I used is:

   try:
            title= soup.find('span',{'id':'listing-title-wrap'})
            title_result= str(title.get_text().strip())
            print "Title: ",title_result
    except StandardError as e:
            title_result="Error was {0}".format(e)
            print title_result

Output:

"Al Bayanأ¢â‚¬آھ,أ¢â‚¬آھ

                            Nepal"

How can I convert the format into the following

['Al Bayan', 'Nepal']

What should be the line second of the code to get this output

Are they all in that format? Some jbberish and then 2 line breaks then the real text? — Keatinge, Jun 01 '16 at 07:06
@Keatinge yes .. its a constant format. perhaps there is some function to remove this unwanted text and spaces — Panetta, Jun 01 '16 at 07:12
@Panetta I'm pretty sure you're doing this all wrong, are you using the same HTML as from your other identical question yesterday? If you are I will show you a much easier way — Keatinge, Jun 01 '16 at 07:12
Consider `a` as ur string. `[i.replace(" ","") for i in filter(None,(a.decode('unicode_escape').encode('ascii','ignore')).split('\n'))]` — Rahul K P, Jun 01 '16 at 07:13
@Keatinge yes its the same code. I couldnt get the answer there. All I want is to convert the output into proper format. Its more of a python string problem — Panetta, Jun 01 '16 at 07:16
@Panetta no it's not a python string problem. Just use an html parser like BeautifulSoup and it's 100x easier. Look at my answer — Keatinge, Jun 01 '16 at 07:16

Keatinge · Accepted Answer · 2016-06-01T13:29:53.827

1

You're reading it wrong, just the read the spans with class location

soup = BeautifulSoup(html, "html.parser")
locList = [loc.text for loc in soup.find_all("span", {"class" : "location"})]
print(locList)

This prints exactly what you wanted:

['Al Bayan', 'Nepal']

edited Jun 01 '16 at 13:29

answered Jun 01 '16 at 07:15

Keatinge

4,330
6
25
44

[u'Al Bayan', 'u'Nepal] This is the output. – Panetta Jun 01 '16 at 07:22
map with string. That will give your expected result. `map(str,output_list)` – Rahul K P Jun 01 '16 at 07:24
@Panetta I've changed it slightly, run it now. There's no reason to use a map when there's already a list comp anyway – Keatinge Jun 01 '16 at 07:25
@Keatinge You are right. I just suggested an alternative option. – Rahul K P Jun 01 '16 at 07:26
@Panetta: do not call `str()` on Unicode strings—it breaks as soon as you get a non-ascii character in it. Format the list manually if you don't like: `[u'Al Bayan', 'u'Nepal] ` text representation e.g., `print("\n".join(locList))` (to print each item on its own line). See [Python string prints as [u'String'\]](http://stackoverflow.com/a/36891685/4279) – jfs Jun 01 '16 at 12:39

score 0 · Answer 2 · answered Jun 01 '16 at 07:15

0

There is a one line solution. Consider a as your string.

In [38]: [i.replace("  ","") for i in filter(None,(a.decode('unicode_escape').encode('ascii','ignore')).split('\n'))]
Out[38]: ['Al Bayan,', 'Nepal']

answered Jun 01 '16 at 07:15

Rahul K P

15,740
4
35
52

asci codec cant encode character u'\u202a'. Tried it and this was the error – Panetta Jun 01 '16 at 07:19
@Panetta What is your exact error. And what you given as input. It's worked for me. – Rahul K P Jun 01 '16 at 07:23

score 0 · Answer 3 · answered Jun 01 '16 at 07:19

0

You can use regexp to filter only letter and spaces :

>>> import re
>>> re.findall('[A-Za-z ]+', area_result)
['Al Bayan', ' Nepal']

Hope it'll be helpful.

answered Jun 01 '16 at 07:19

3kt

2,543
1
17
29

Removing string extra characters via python string functions

3 Answers3