webpage link - http://www.atlasoftheuniverse.com/stars.html
I tried using pandas read_html and web scraping libraries like bs4 but no luck as the data on the webpage is not wrapped inside a table tag. Please help me out!
webpage link - http://www.atlasoftheuniverse.com/stars.html
I tried using pandas read_html and web scraping libraries like bs4 but no luck as the data on the webpage is not wrapped inside a table tag. Please help me out!
<pre>
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.atlasoftheuniverse.com/stars.html")
soup = BeautifulSoup(res.content, "html.parser",)
txt = soup.find("pre").getText()
# fixed format, infer the columns
df = pd.read_fwf(io.StringIO(txt), infer_nrows=300)
# just rows that have a number in first columns
dfd = df[df["1"].str.match("^[0-9][0-9]*\.$").fillna(False)]
dfd.head(10)
dfd.head(10)
1 | 2 | 3 | 4 5 | 6 7 | 8 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|
3 | 1 | Alpha Canis Majoris | Sirius | 06 45 -16.7 | 227.2 -8.9 | A1V -1.46 | 1.43 | 379.21 | 1.58 | 9 |
4 | 2 | Alpha Carinae | Canopus | 06 24 -52.7 | 261.2 -25.3 | F0Ib -0.73 | -5.64 | 10.43 | 0.53 | 310 |
5 | 3 | Alpha Centauri | Rigil Kentaurus | 14 40 -60.8 | 315.8 -0.7 | G2V+K1V -0.29 | 4.06 | 742.12 | 1.4 | 4 |
6 | 4 | Alpha Boötis | Arcturus | 14 16 +19.2 | 15.2 +69.0 | K2III -0.05 | -0.31 | 88.85 | 0.74 | 37 |
7 | 5 | Alpha Lyrae | Vega | 18 37 +38.8 | 67.5 +19.2 | A0V 0.03 | 0.58 | 128.93 | 0.55 | 25 |
8 | 6 | Alpha Aurigae | Capella | 05 17 +46.0 | 162.6 +4.6 | G5III+G0III 0.07 | -0.49 | 77.29 | 0.89 | 42 |
9 | 7 | Beta Orionis | Rigel | 05 15 -8.2 | 209.3 -25.1 | B8Ia 0.15v | -6.72v | 4.22 | 0.81 | 770 |
10 | 8 | Alpha Canis Minoris | Procyon | 07 39 +5.2 | 213.7 +13.0 | F5IV-V 0.36 | 2.64 | 285.93 | 0.88 | 11 |
11 | 9 | Alpha Eridani | Achernar | 01 38 -57.2 | 290.7 -58.8 | B3V 0.45 | -2.77 | 22.68 | 0.57 | 144 |
12 | 10 | Alpha Orionis | Betelgeuse | 05 55 +7.4 | 199.8 -9.0 | M2Ib 0.55v | -5.04v | 7.63 | 1.64 | 430 |
I did it like this:
Note I have used a truncated data set here because of character limits:
import pandas as pd
import io
data = """
1. Alpha Canis Majoris Sirius 06 45 -16.7 227.2 -8.9 A1V -1.46 1.43 379.21 1.58 9
2. Alpha Carinae Canopus 06 24 -52.7 261.2 -25.3 F0Ib -0.73 -5.64 10.43 0.53 310
3. Alpha Centauri Rigil Kentaurus 14 40 -60.8 315.8 -0.7 G2V+K1V -0.29 4.06 742.12 1.40 4
4. Alpha Boötis Arcturus 14 16 +19.2 15.2 +69.0 K2III -0.05 -0.31 88.85 0.74 37
5. Alpha Lyrae Vega 18 37 +38.8 67.5 +19.2 A0V 0.03 0.58 128.93 0.55 25
6. Alpha Aurigae Capella 05 17 +46.0 162.6 +4.6 G5III+G0III 0.07 -0.49 77.29 0.89 42
7. Beta Orionis Rigel 05 15 -8.2 209.3 -25.1 B8Ia 0.15v -6.72v 4.22 0.81 770
8. Alpha Canis Minoris Procyon 07 39 +5.2 213.7 +13.0 F5IV-V 0.36 2.64 285.93 0.88 11
9. Alpha Eridani Achernar 01 38 -57.2 290.7 -58.8 B3V 0.45 -2.77 22.68 0.57 144
10. Alpha Orionis Betelgeuse 05 55 +7.4 199.8 -9.0 M2Ib 0.55v -5.04v 7.63 1.64 430
"""
# Using Pandas with a column specification
col_specification = [(0, 5),
(5, 31),
(31, 50),
(50, 53),
(53, 56),
(56, 63),
(63, 68),
(68, 74),
(74, 79),
(79, 94),
(94, 101),
(101, 109),
(109, 106),
(106, 114),
(114, 120)]
data = pd.read_fwf(io.StringIO(data), colspecs=col_specification)
data