1

I am learning how to webscrape with python using a Wikepedia article. I managed to get the data I needed, the tables, by using the .get_text() method on the table rows ().

I am cleaning up the data in Pandas and one of the routines involves getting the date a book or movie was published. Since there are many ways in which this can occur such as: (1986) (1986-1989) (1986-present)

Currently, I am using the code below which works on a test sentence:

# get the first columns of row 19 from the table and get its text
test = data_collector[19].find_all('td')[0]
text = test.get_text()
#create and test the pattern
pattern = re.compile('\(\d\d\d\d\)|\(\d\d\d\d-\d\d\d\d\)|\(\d\d\d\d-[ Ppresent]*\)')
re.findall(pattern, 'This is Agent (1857), the years were (1987-1868), which lasted from (1678- Present)')

I get the expected output on the test sentence.

['(1857)', '(1987-1868)', '(1678- Present)']

However, when I test it on a particular piece of text from the wiki article 'The Adventures of Sherlock Holmes (1891–1892) (series), (1892) (novel), Arthur Conan Doyle\n', I am able to extract (1892), but NOT (1891-1892).

text = test.get_text()
re.findall(pattern, text)
o/p: ['(1892)']

Even as I type this, I can see that the hyphen that I am using and the one on the text are different. I am sure that this is the issue and was hoping if someone could tell me what this particular symbol is called and how I can actually "type" it using my keyboard.

Thank you!

  • Are you sure there is a hyphen and not em-dash? Try `re.compile(r'\(\d{4}(?:[\s–—-]+(?:\d{4}|present))?\)', re.I)`. See the [regex demo](https://regex101.com/r/BSM7JU/3). – Wiktor Stribiżew Feb 26 '19 at 10:59
  • If you can use [`regex`](http://pypi.python.org/pypi/regex), you can use the Unicode character category `\p{Pd}` to match all dashes - see https://stackoverflow.com/q/1832893/3001761 – jonrsharpe Feb 26 '19 at 11:00
  • I agree with @Wiktor, the character may not be exactly as it appears to be. Another solution would be to replace the '-' with '\S'. Meaning match any non white space character – JimmyA Feb 26 '19 at 11:05
  • `\p{Pd}` includes a lot of [symbols similar to hyphen](https://stackoverflow.com/a/39485500/3832970). Some do not look like hyphens though. Use `\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D` instead of hyphen/dashes then. Or, to match any non-word char, probably, other than `(` and `)`, `[^\w()]` => `re.compile(r'\(\d{4}(?:[^\w()]+(?:\d{4}|present))?\)', re.I)` – Wiktor Stribiżew Feb 26 '19 at 11:07
  • @WiktorStribiżew Thank you! Your solution works flawlessly. – Vignesh Viswanathan Feb 26 '19 at 11:10

1 Answers1

1

I suggest enhancing the pattern to search for the most common hyphens, -, and , and fix the present pattern from a character class to a char sequence (so as not to match sent with [ Ppresent]*):

re.compile(r'\(\d{4}(?:[\s–—-]+(?:\d{4}|present))?\)', re.I)

See the regex demo. Note that re.I flag will make the regex match in a case insensitive way.

Details

  • \( - a (
  • \d{4} - four digits ({4} is a limiting quantifier that repeats the pattern it modifies four times)
  • (?:[\s–—-]+(?:\d{4}|present))? - an optional (as there is a ? at the end) non-capturing (due to ?:) group matching 1 or 0 occurrences of
    • [\s–—-]+ - 1 or more whitespaces, -, or
    • (?:\d{4}|present) - either 4 digits or present
  • \) - a ) char.

If you plan to match any hyphens use [\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\s]+ instead of [\s–—-]+.

Or, to match any 1+ non-word chars at that location, probably, other than ( and ), use [^\w()]+ instead: re.compile(r'\(\d{4}(?:[^\w()]+(?:\d{4}|present))?\)', re.I).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563