0

webpage link - http://www.atlasoftheuniverse.com/stars.html

I tried using pandas read_html and web scraping libraries like bs4 but no luck as the data on the webpage is not wrapped inside a table tag. Please help me out!

Atharva Katre
  • 457
  • 1
  • 6
  • 17
  • this is known as a fixed width delimited file, you may find it easiest to import into google sheets / excel and then save as a csv – Matt Feb 17 '21 at 20:35
  • then after that you can use the pandas result here https://stackoverflow.com/a/23354484/5125264 – Matt Feb 17 '21 at 20:42

2 Answers2

0
  • the data is contained in the <pre>
  • create DF using read_fwf
  • infer does a reasonable job, if the rest is important you can specify the column widths
import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get("http://www.atlasoftheuniverse.com/stars.html")
soup = BeautifulSoup(res.content, "html.parser",)
txt = soup.find("pre").getText()

# fixed format,  infer the columns
df = pd.read_fwf(io.StringIO(txt), infer_nrows=300)

# just rows that have a number in first columns
dfd = df[df["1"].str.match("^[0-9][0-9]*\.$").fillna(False)]
dfd.head(10)

dfd.head(10)

1 2 3 4 5 6 7 8 9 10 11 12 13
3 1 Alpha Canis Majoris Sirius 06 45 -16.7 227.2 -8.9 A1V -1.46 1.43 379.21 1.58 9
4 2 Alpha Carinae Canopus 06 24 -52.7 261.2 -25.3 F0Ib -0.73 -5.64 10.43 0.53 310
5 3 Alpha Centauri Rigil Kentaurus 14 40 -60.8 315.8 -0.7 G2V+K1V -0.29 4.06 742.12 1.4 4
6 4 Alpha Boötis Arcturus 14 16 +19.2 15.2 +69.0 K2III -0.05 -0.31 88.85 0.74 37
7 5 Alpha Lyrae Vega 18 37 +38.8 67.5 +19.2 A0V 0.03 0.58 128.93 0.55 25
8 6 Alpha Aurigae Capella 05 17 +46.0 162.6 +4.6 G5III+G0III 0.07 -0.49 77.29 0.89 42
9 7 Beta Orionis Rigel 05 15 -8.2 209.3 -25.1 B8Ia 0.15v -6.72v 4.22 0.81 770
10 8 Alpha Canis Minoris Procyon 07 39 +5.2 213.7 +13.0 F5IV-V 0.36 2.64 285.93 0.88 11
11 9 Alpha Eridani Achernar 01 38 -57.2 290.7 -58.8 B3V 0.45 -2.77 22.68 0.57 144
12 10 Alpha Orionis Betelgeuse 05 55 +7.4 199.8 -9.0 M2Ib 0.55v -5.04v 7.63 1.64 430
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
0

I did it like this:

  • copy the table (spaces and everything!)
  • use text editor to get column width delimters
  • read in using pandas

Note I have used a truncated data set here because of character limits:

import pandas as pd
import io

data = """
  1. Alpha Canis Majoris       Sirius            06 45 -16.7  227.2  -8.9  A1V          -1.46   1.43  379.21 1.58     9
  2. Alpha Carinae             Canopus           06 24 -52.7  261.2 -25.3  F0Ib         -0.73  -5.64   10.43 0.53   310
  3. Alpha Centauri            Rigil Kentaurus   14 40 -60.8  315.8  -0.7  G2V+K1V      -0.29   4.06  742.12 1.40     4
  4. Alpha Boötis              Arcturus          14 16 +19.2   15.2 +69.0  K2III        -0.05  -0.31   88.85 0.74    37
  5. Alpha Lyrae               Vega              18 37 +38.8   67.5 +19.2  A0V           0.03   0.58  128.93 0.55    25
  6. Alpha Aurigae             Capella           05 17 +46.0  162.6  +4.6  G5III+G0III   0.07  -0.49   77.29 0.89    42
  7. Beta Orionis              Rigel             05 15  -8.2  209.3 -25.1  B8Ia          0.15v -6.72v   4.22 0.81   770
  8. Alpha Canis Minoris       Procyon           07 39  +5.2  213.7 +13.0  F5IV-V        0.36   2.64  285.93 0.88    11
  9. Alpha Eridani             Achernar          01 38 -57.2  290.7 -58.8  B3V           0.45  -2.77   22.68 0.57   144
 10. Alpha Orionis             Betelgeuse        05 55  +7.4  199.8  -9.0  M2Ib          0.55v -5.04v   7.63 1.64   430
"""

# Using Pandas with a column specification
col_specification = [(0, 5),
 (5, 31),
 (31, 50),
 (50, 53),
 (53, 56),
 (56, 63),
 (63, 68),
 (68, 74),
 (74, 79),
 (79, 94),
 (94, 101),
 (101, 109),
 (109, 106),
 (106, 114),
 (114, 120)]

data = pd.read_fwf(io.StringIO(data), colspecs=col_specification)

data
Matt
  • 1,196
  • 1
  • 9
  • 22