I am learning how to webscrape with python using a Wikepedia article. I managed to get the data I needed, the tables, by using the .get_text() method on the table rows ().
I am cleaning up the data in Pandas and one of the routines involves getting the date a book or movie was published. Since there are many ways in which this can occur such as: (1986) (1986-1989) (1986-present)
Currently, I am using the code below which works on a test sentence:
# get the first columns of row 19 from the table and get its text
test = data_collector[19].find_all('td')[0]
text = test.get_text()
#create and test the pattern
pattern = re.compile('\(\d\d\d\d\)|\(\d\d\d\d-\d\d\d\d\)|\(\d\d\d\d-[ Ppresent]*\)')
re.findall(pattern, 'This is Agent (1857), the years were (1987-1868), which lasted from (1678- Present)')
I get the expected output on the test sentence.
['(1857)', '(1987-1868)', '(1678- Present)']
However, when I test it on a particular piece of text from the wiki article 'The Adventures of Sherlock Holmes (1891–1892) (series), (1892) (novel), Arthur Conan Doyle\n', I am able to extract (1892), but NOT (1891-1892).
text = test.get_text()
re.findall(pattern, text)
o/p: ['(1892)']
Even as I type this, I can see that the hyphen that I am using and the one on the text are different. I am sure that this is the issue and was hoping if someone could tell me what this particular symbol is called and how I can actually "type" it using my keyboard.
Thank you!