-1

I have to scan through MANY blocks of text which could be done with a loop, or a single line of find_all, I think. Here is a small sample of the text that I'm dealing with.

<tr role="row" class="even">
<td>

<td style="padding:0px; width:200px; height:10px;"><svg height="37" width="180px" id="task-run" style="display: block;">

</td>

The '' represents nothing, as in this case: y="3"></text>

I have code to append everything to a large list and then write that to a data frame.

masterlist = []
etc.
masterlist.append(cols)
etc.
df = pd.DataFrame(masterlist)

I just can't figure out how to do all the parsing.

ASH
  • 20,759
  • 19
  • 87
  • 200

2 Answers2

2

This looks like a good case for Beautiful Soup which is designed to extract text and attributes from HTML documents that may or may not be well-formed.

Tom
  • 22,301
  • 5
  • 63
  • 96
  • Ok, so I tried this: foundhref = soup.find('td',{'href':'a'}).get_text() I'm getting this error: AttributeError: 'NoneType' object has no attribute 'get_text' Also, I tried this: foundtext = soup.find('td',{'transform':'text'}).get_text() I get the same exact error. What am I doing wrong here? – ASH Jan 04 '19 at 20:38
  • I just tried this: results = pd.read_html(html_page) That actually does a really nice job, for the most part, but it joins all the numbers together if they are in the same column. So, in my example above, I'm seeing this: 12141620 ...and... 39 It should be like this: 12 14 16 '' 20 '' 3 '' 9 Those represent distinct counts of tasks. – ASH Jan 04 '19 at 20:47
1

1) if all info you required is in well formed table inside HTML, I recommend you try DataFrame.read_html. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

2) second choice is to try Beautiful Soup as @Tom mentioned already.

3) if you are facing challenge of large file, you should try: Lazy Method for Reading Big File in Python? and then parse it line by line.

xudesheng
  • 1,082
  • 11
  • 25