I often find myself wandering through large sets of text, extracting terms or otherwise cleaning things so I re-use a string as a filename or such like.
In a recent task, I grabbed a few hundred pdf files from a website, and wanted to use the article title as the filename to assist my colleagues in checking in the files.
I can get the title from the html, but often illegal win O/S chars are used in the title (e.g. :
, "
, >
etc), which means I have to do some substitutions to ensure that I can use the title.
As a result of the above, I started using this line of code:-
fname = art_number+" "+content_title.replace(":", " -").replace("–", "-").replace(u'\xae', "-").replace("\"", "").replace("?","").replace("<i>", "").replace("</i>", "").replace("/", " ").replace("<sup>-< sup>", "-")
As you can see. Heaps of str.replace
, not very readable or manageable.
Each of the replacements are generally manually considered, I wouldn't want to throw them at a code book as there are usually some nuances per set of content that I want to find and check.
What would be your approach to this?