1

I'm trying to read a webpage and output the formatted text to a text file. The code below prints to the shell with formatting but when I write it to the file it puts it on one line (with the linebreaks /n present in the text).

I have tried a variety of things such as not converting it to a string, using prettify from beautiful soup but none seem to produce a text file with formatting. I am presuming I am missing something fairly basic. Any help or guidance would be much appreciated.

# Import 
from urllib.request import urlopen
from bs4 import BeautifulSoup

#The actual code


URL = "https://simple.wikipedia.org/wiki/castle" #The target URL
html = urlopen(URL).read()  # Reads the url to variable html
soup = BeautifulSoup(html, "lxml") # Uses BS4 to create the soup using the lxml parser
soup = soup.get_text() # Extracts the text
print(soup) # Prints to python 3.5.1 shell, formatted as I would expect


# Now writing what I have extracted to a text file
file = open("TextOutput.txt", 'w') # Creates the file and opens as write (w)
file.writelines(str(soup.encode('UTF-8'))) # Tried file.write/lines(soup), convertion to string and encoding as UTF-8 needed to avoid errors
file.close()

A sample of the file output looks like:

b'\n\n\nCastle - Simple English Wikipedia, the free encyclopedia\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );\n(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Castle","wgTitle":"Castle","wgCurRevisionId":5333370,"wgRevisionId":5333370,"wgArticleId":15933,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":[""],"wgCategories":["Castles"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Castle","wgRelevantArticleId":15933,"wgRequestId":"VxUR5gpAIDAAAEXY6FMAAACC","wgIsProbablyEditable":true,"wgRestrictionEdit":[],"wgRestrictionMove":[],"wgWikiEditorEnabledModules":{"toolbar":true,"dialogs":true,"preview":false,"publish":false},"wgBetaFeaturesFeatures":[],"wgMediaViewerOnClick":true,"wgMediaViewerEnabledByDefault":true,"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","usePageImages":true,"usePageDescriptions":true},"wgPreferredVariant":"en","wgRelatedArticles":null,"wgRelatedArticlesUseCirrusSearch":true,"wgRelatedArticlesOnlyUseCirrusSearch":false,"wgULSAcceptLanguageList":[],"wgULSCurrentAutonym":"English","wgCategoryTreePageCategoryOptions":"{\"mode\":0,\"hideprefix\":20,\"showcount\":true,\"namespaces\":false}","wgNoticeProject":"wikipedia","wgCentralNoticeCategoriesUsingLegacy":["Fundraising","fundraising"],"wgCentralAuthMobileDomain":false,"wgWikibaseItemId":"Q23413","wgVisualEditorToolbarScrollOffset":0});mw.loader.implement("user.options",function($,jQuery){mw.user.options.set({"variant":"en"});});mw.loader.implement("user.tokens",function ( $, jQuery ) {\nmw.user.tokens.set({"editToken":"+\\","patrolToken":"+\\","watchToken":"+\\","csrfToken":"+\\"});/@nomin*/;\n\n});mw.loader.load(["mw.MediaWikiPlayer.loader","mw.PopUpMediaTransform","mw.TMHGalleryHook.js","mediawiki.page.startup","mediawiki.legacy.wikibits","ext.centralauth.centralautologin","mmv.head","ext.visualEditor.desktopArticleTarget.init","ext.uls.init","ext.uls.interface","ext.centralNotice.bannerController","skins.vector.js"]);});\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCastle\n\nFrom Wikipedia, the free encyclopedia\n\n\n\t\t\t\t\tJump to:\t\t\t\t\tnavigation, \t\t\t\t\tsearch\n\n\n\n\n\nBodiam Castle in England surrounded by a water-filled moat.\n\n\n\n\n\n\nLichtenstein Castle\n\n\nA castle (from the Latin word castellum) is a fortified structure made in Europe and the Middle East during the Middle Ages. People argue about what the word castle means. However, it usually means a private structure of a lord or noble. This is different from a fortress, which is not a home, and from a fortified town, which was a public defence. For about 900\xc2\xa0years that castles were built they had many different shapes and different details.\nCastles began in Europe in the 9th and 10th centuries. They controlled the places surrounding them, and could both help in attacking and defending. Weapons could be fired from castles, or people could be protected from enemies in castles. However, castles were also a symbol of power. They could be used to control the people and roads around it.\nMany castles were built with earth and wood at first often using manual labour, and then had their defences replaced by stone instead. Early castles often used nature for protection, and did not have towers. By the late 12th and early 13th centuries, though, castles became longer and more complex.\n

Tom
  • 23
  • 1
  • 6

1 Answers1

1

file.writelines(str(soup.encode('UTF-8'))) is kind of insane, it's:

  1. Encoding text (str) to binary (bytes)
  2. Getting the text representation of that by wrapping in str (so it's what you'd type to recreate the binary bytes, but it's not the raw binary)
  3. Writing that result one character at a time (writelines iterates what you give it, and strs iterate by character)

Step #3 is silly and inefficient, but mostly harmless. Step #1 would be fine if you then wrote the raw binary to a file opened for binary write and actually wrote the bytes object. But #1 and #2 together mean that stuff like a new line gets converted to a literal \n in the output, rather than actually breaking a line. Non-ASCII stuff like é is output as \xc3\xa9, and the whole thing is wrapped in b'' (or b"").

You want something like:

# open with UTF-8 encoding (in case your system defaults to something else)
with open("TextOutput.txt", 'w', encoding='utf-8') as file:
    # Get the text and write it as a single block
    file.write(soup.get_text())
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • It did feel insane. That produces what I expected, thankyou! I think I need to learn more (I had something like file.open and file.write to begin with - assuming perhaps i needed the different syntax/ UTF-8 encoding). – Tom Apr 18 '16 at 18:18
  • @user6217257: I'm guessing you're on Windows? On Windows, the default encoding is usually not what you want; it's a locale specific ASCII-superset (for English and many Western European locales, [`cp1252`](https://en.wikipedia.org/wiki/CP1252)). Problem is, the web is largely UTF-8, and in this case, the page you're scraping has a `↑` character, which does not appear in CP1252. Without specifying an encoding that can handle that (for maximum compatibility and tool friendliness, you usually want UTF-8, or on Windows perhaps UTF-16), you'd get errors when it tried to encode as CP1252. – ShadowRanger Apr 18 '16 at 20:58
  • I am on Windows, thats interesting information. Is that also why running the above (modified script) through a double click in windows fails giving some kind of traceback error? But not when the print(soup) command is removed. – Tom Apr 18 '16 at 22:48
  • @user6217257: A traceback describes where an exception occurred; it wouldn't be a "traceback error". Running via double click could error out for many reasons, the most likely being an issue with multiple Python versions installed (usually one Py2, one Py3), and the script is written for one version, while the `.py`/`.pyw` extension is registered to the other. I wouldn't expect it to be related to Windows encoding, but the Py3-isms in this code (`urllib.request`, the `encoding` argument to `open`) would cause errors if the handler for the Python extensions was a Python 2 install. – ShadowRanger Apr 18 '16 at 22:56
  • This would be an example: Traceback (most recent call last): File "", line 15, in html = urlopen(URL).read() File "C:\Users\------\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen return opener.open(url, data, timeout) – Tom Apr 18 '16 at 23:03
  • That should have an exception associated with it. I see the traceback, but it should also be saying what type of exception was raised, and (often) providing some sort of message. – ShadowRanger Apr 19 '16 at 00:12