0

I am having a problem stripping commas out of a string while doing some web scraping. My code is as follows.

import urllib

import re

htmlfile = urllib.urlopen ("http://example.com") 

htmltext = htmlfile.read ()

regex = 'Posts: (.+?)\n'

value = re.compile(regex)

posts = re.findall(value,htmltext)

print posts[0]

Now I am getting the data ok but the problem is the post count is coming down with commas with a value such as 1,092,391, and I want to strip the commas out to leave a number such as 1092391.

I've got Python 2.7.1 installed and nothing I've found on here or Google has seemed to work. I am a bit of a newbie though, so I am no doubt missing something so silly here but I do love to learn and get my hands dirty. So any help would be much appreciated.

tshepang
  • 12,111
  • 21
  • 91
  • 136
UKJason
  • 1
  • 1
  • 1

4 Answers4

2

Replace them:

posts[0].replace(',', '')

Or use the locale module (if your locale's thousands delimiter is a comma):

import locale

locale.setlocale(locale.LC_ALL, '')
n = locale.atoi(posts[0])

I would advise against using just regex for scraping. Unless Posts: (.*?) is all you're after, parse the HTML with a HTML parser like lxml or BeautifulSoup.

Blender
  • 289,723
  • 53
  • 439
  • 496
  • The locale thing won’t work when you have for example a German locale where the comma is the decimal point separator (*“ValueError: invalid literal for int() with base 10: '1.092.391'”*). – poke May 07 '13 at 17:12
  • @poke: Which is why I pointed it out in my comment. – Blender May 07 '13 at 17:18
  • That is my next step. I am very new to this. I intend to parse the HTML with my next step. Learning to code is easier than learning a spoken language I hope :) – UKJason May 07 '13 at 22:21
2
>>> '1,092,391'
'1,092,391'
>>> '1,092,391'.replace(',', '')
'1092391'
>>> int('1,092,391'.replace(',', ''))
1092391

nothing I've found on here or Google has seemed to work

I’m having a hard time to believe that. A quick search for “Python string replace” should get you to str.replace very quickly, not to mention that searching it in the Python documentation gets you there even faster. The first result I get for “Python comma replace” is even a question on SO answering your problem.

And if everything failed, you could have used regular expressions which you apparently already know how to use.

Community
  • 1
  • 1
poke
  • 369,085
  • 72
  • 557
  • 602
  • Thank you for friendly reply. Apologies for any English problem. I have a .py file and not Python shell as this has to run on server. So I was having a hard time. I checked the link you gave me and I that came up with the Google search I mentioned. I got this error. Traceback (most recent call last): File "file.py", line 14, in price = price.replace(",", "") AttributeError: 'list' object has no attribute 'replace' That is why it didn't work. Most things I try ends like this it seems. It is probably something really simple. – UKJason May 07 '13 at 17:30
  • This seemed to work out ok for any future searches. print price[0].replace(',', '') – UKJason May 07 '13 at 17:39
0

here's a very simple way .. just replace the , with the empty string.

 >>> '1,092,391'.replace(',','')
 '1092391'
eduffy
  • 39,140
  • 13
  • 95
  • 92
0
"".join('1,092,391'.split(','))
Yarkee
  • 9,086
  • 5
  • 28
  • 29