How to efficiently loop through a huge chunk of text and parse several items?

Question

I have to scan through MANY blocks of text which could be done with a loop, or a single line of find_all, I think. Here is a small sample of the text that I'm dealing with.

<tr role="row" class="even">
<td>

<td style="padding:0px; width:200px; height:10px;"><svg height="37" width="180px" id="task-run" style="display: block;">

</td>

The '' represents nothing, as in this case: y="3"></text>

I have code to append everything to a large list and then write that to a data frame.

masterlist = []
etc.
masterlist.append(cols)
etc.
df = pd.DataFrame(masterlist)

I just can't figure out how to do all the parsing.

If you search on the phrase "Python parse HTML", you’ll find resources that can explain it much better than we can in an answer here. — Prune, Jan 04 '19 at 19:32

score 2 · Answer 1 · answered Jan 04 '19 at 19:33

2

This looks like a good case for Beautiful Soup which is designed to extract text and attributes from HTML documents that may or may not be well-formed.

answered Jan 04 '19 at 19:33

Tom

22,301
5
63
96

Ok, so I tried this: foundhref = soup.find('td',{'href':'a'}).get_text() I'm getting this error: AttributeError: 'NoneType' object has no attribute 'get_text' Also, I tried this: foundtext = soup.find('td',{'transform':'text'}).get_text() I get the same exact error. What am I doing wrong here? – ASH Jan 04 '19 at 20:38
I just tried this: results = pd.read_html(html_page) That actually does a really nice job, for the most part, but it joins all the numbers together if they are in the same column. So, in my example above, I'm seeing this: 12141620 ...and... 39 It should be like this: 12 14 16 '' 20 '' 3 '' 9 Those represent distinct counts of tasks. – ASH Jan 04 '19 at 20:47

score 1 · Accepted Answer · answered Jan 04 '19 at 19:36

1) if all info you required is in well formed table inside HTML, I recommend you try DataFrame.read_html. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

2) second choice is to try Beautiful Soup as @Tom mentioned already.

3) if you are facing challenge of large file, you should try: Lazy Method for Reading Big File in Python? and then parse it line by line.

How to efficiently loop through a huge chunk of text and parse several items?

2 Answers2