How can I use a script to change some text in an epub file?

Question

I've recently bought a Nook Simple Touch. I use Calibre to manage my ebooks, and to transfer them to the Nook.

Due to a non-standard implementation of the epub specification on B&N's part, the Nook ST does not display cover images when they are brought over from many sources. The issue is described here: http://john.nachtimwald.com/2011/08/21/nook-covers-not-showing-up/ Basically the Nook ST requires the XML attribute for the cover to be in the format:

<meta name="cover" content="id5" />

But many epub creators have them around this way:

<meta content="id5" name="cover" />

And the Nook ST then ignores the cover image entirely.

I have been manually editing the content.opf file in my epub files. So far they have all had the image meta, but it was always around the "wrong" way (wrong, according to the Nook).

Recently I've been playing around with REGEX, mostly to try to automate the cleaning up of epubs converted by Calibre from PDF files. I'm still very much a beginner with REGEX.

What I was wondering is how I might go about automating the swapping of the 'name' and 'content' attributes? I figure it can be done with a combination of REGEX and scripting. I know some of the other epub related scripts I have are in Python. I am on a Mac (OS X) and they seem to run fine. AppleScript might be a good option too, although I'd like something that people can run on any platform, as I am sure other folk will find this useful.

Here are the steps I foresee:

~ Extract epub file

~ Use REGEX to look for:

<meta content="???" name="cover">

~ If found, use REGEX to change it around to:

<meta name="cover" content="???">

~ Zip extracted files back into an epub using the correct zipping process.

I found info here: http://www.mobileread.com/forums/showthread.php?t=55681 explaining how to zip up an epub file correctly. Basically it requires these two commands:

zip -X0 "full path to new epub file" mimetype
zip -rDX9 "full path to new epub file" * -x "*.DS_Store" -x mimetype

I'd like to post the resulting script online where ever it might be found and made use of (until B&N resolve their poor epub/XML implementation). Posting it on the Calibre forums and the mobileread forums comes to mind (since they are two I am familiar with, and have seen people discussion manual fixes to this issue).

Is there someone who can walk me through how to create such a script? Ideally, I'd love to actually know how to create the script, so that over time I can start to figure out these sorts of things myself (especially the REGEX part, as I see more and more how useful it is).

Thank you.

Jonathan

@Haldean: ADDED to illustrate what I mean in a comment to Haldean regarding making his script work through all content.opf files in all subfolders recursively.

> My_expanded_epubs
- -> epub_one_expanded
- - - -> content.opf
- -> epub_two_expanded
- - - -> content.opf
- -> epub_three_expanded
- - - -> content.opf
etc.

Have you got a regex that can correctly identify the meta tags you are looking for? — Marcin, Feb 17 '12 at 16:56
Also, you should complain to B&N. There is no excuse for having an XML processor that requires attributes to be in any particular order. — Marcin, Feb 17 '12 at 16:59
Thanks Marcin. I am sending a message to B&N now regarding this issue. — inspirednz, Feb 17 '12 at 17:20
I've searched on Stackoverflow and on Google for "unpack epub file python" but turned up nothing useful. Am I barking up the wrong tree with that idea? I found lots of stuff about removing DRM from epubs with python, but not for simply unpacking an epub (and repacking it). I know I can most likely use AppleScript to piece the various steps together, but really want this to be platform independent. — inspirednz, Feb 18 '12 at 01:40

score 2 · Answer 1 · answered Feb 17 '12 at 18:08

2

If you're willing to go with a shell script (which I think is a better option) then you can use a sed one-liner:

sed 's/<meta content="\(.*\)" name="cover" \/>/<meta name="cover" content="\1" \/>/' [your-file]

That should replace all meta lines where the content attribute comes first with one with the correct order. An equivalent Python translation of that would be:

import re
import sys
with open(sys.argv[1]) as f:
  for line in f:
    # Match this line to the wrong-way-around meta tag, put the content in group 1
    m = re.match(r'<meta content="(.*)" name="cover" />', line)
    if not m:
      print line
    else:
      print '<meta name="cover" content="%s" />' % m.group(1)

answered Feb 17 '12 at 18:08

Haldean Brown

12,411
5
43
58

Note that your regexes are not robust to any variations in spacing. – Marcin Feb 17 '12 at 18:29
Thanks Haldean. Python looks like a relatively simple language to get my head around. Reminds me of Basic.. which I played around with perhaps 25 years ago. I'll try this out, perhaps with the regex Marcin provided, for the reasons he mentioned. – inspirednz Feb 18 '12 at 00:42
@Haldean: Do you happen to know how to use Python to unpack and repack the epub file? I can't seem to turn up that info anywhere. It needs to do it in the way specified in my original post. – inspirednz Feb 18 '12 at 02:00
@Haldean Okay, I've been messing around with Python but have not managed to get my head (or code) around how to implement the slightly different regular expression Marcin suggested. I'd also appreciate knowing how to make this script check the content.opf file, within all the folders (expanded epub files) recursively. I tried to place an example here but the comments don't recognise line breaks, so I've added it to the end of my original post. – inspirednz Feb 22 '12 at 02:36

score 1 · Answer 2 · answered Feb 17 '12 at 18:27

1

I would suggest that you use sed to work with the unpacked file, and do something like:

sed -e 's/<[ ]*meta[ ]*content[ ]*=[ ]*"\(.*\)"[ ]*name[ ]*=[ ]*"cover"[ ]*\/*[ ]*>/<meta name="cover" content="\1" \/>/g'

Note that this version will cope with extra or missing space, or slashes.

You may like to subsequently use an xml processor (I would suggest a python script using lxml) to verify that your edit has not for any created invalid markup.

Using any kind of XML tool to perform the manipulation is unattractive in the extreme, because a fully compliant XML processor may make other changes which are completely legal, and also trigger other bugs in your nook. Using sed allows you to edit only the parts of the document you want to.

answered Feb 17 '12 at 18:27

Marcin

48,559
18
128
201

Thanks for being so thorough in thinking this through. I'll try out your suggestion shortly (been offline all afternoon, hence the delay in responding). If the tag manipulation is all good, the other step I'd like to resolve is how to take care of the other steps. For instance, unpack the epub, run the regex check, repack the epub. If someone has a lot of epub files they wish to sideload into their Nook ST, much of the time would be spent just unpacking and repacking. Would be nice to just run a script on a file, recursively processing all the epub files in there. – inspirednz Feb 18 '12 at 00:43
I can bulk validate the epubs themselves fairly easily with Calibre (from what I recall). Would be a good idea. Thanks. – inspirednz Feb 18 '12 at 00:43
@inspiredlife: If you're having issues with unpacking etc, then I suggest you post a separate question. – Marcin Feb 19 '12 at 13:17

score 0 · Answer 3 · edited May 23 '17 at 12:20

0

Personally I wouldn't do this with regex (its the wrong tool). Could you use XSLT?

EDIT:

Here is a demo. http://www.xsltcake.com/slices/nvLRJ6

There are a number of XSLT librarys for python.

EDIT:

If you insist on doing it with regex you'll want a patten like this:
<meta content="([^"]+)" name="([^"]+)" \/>

I say this with the disclaimer that this is the wrong tool and there are edge cases that make this unreliable and I don't recommend it.

http://regexr.com?301uq

edited May 23 '17 at 12:20

Community

1
1

answered Feb 17 '12 at 16:55

Sam Greenhalgh

5,952
21
37

Okay. Thanks for the suggestion. I've taken a look at the link you provided. It is not clear to me how to use XSLT (something I've never heard of till now) as part of automating the task at hand. Any suggestions? I also read through the page linked to by the one you linked to. I didn't find anything I was able to figure out how to move forward with. – inspirednz Feb 17 '12 at 17:05
So I have been learning more about why regex may not be the way to go about locating the XML attribute in question. Although I'm not convinced yet that it's not possible to use regex for finding something so simple. That may be due to my ignorance on the limitations of regex. – inspirednz Feb 17 '12 at 17:26
@inspiredlife: I would say that regexes are the right tool here, because there may be no way to force a conforming XML parser to output properties in the desired order, and any solution that involves parsing the whole document risks triggering other bugs in the affected implementation. By contrast, a tool like `sed` will allow you to edit only the parts of the document you want to change. – Marcin Feb 17 '12 at 18:08

score -1 · Answer 4 · edited May 23 '17 at 12:27

-1

I agree with zapthedingbat's answer: this is an XML problem, so let's use tools specifically designed for XML, namely XSLT.

Since you're new to XSLT, you'll need an XSLT processor to try this solution. If you are using *nix, xsltproc is a command-line processor and is almost surely installed by default and you can take this solution at face value. If not, you'll need to see if your language of choice has an API for perform XSL transforms.

Here's a very simple general solution for reordering the attributes:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:template match="@*|node()">
  <!-- copy everything as is -->
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="meta">
    <!-- except for the <meta/> element, reverse the attribute order -->
    <meta name="{@name}" content="{@content}"/>
  </xsl:template>
</xsl:stylesheet>

Here's your example:

<root>
  <meta content="id5" name="cover" />
</root>

Running the XSLT with xsltproc:

$ xsltproc so.xsl so.xml

and the result:

<root>
  <meta name="cover" content="id5"/>
</root>

edited May 23 '17 at 12:27

Community

1
1

answered Feb 17 '12 at 18:13

Zach Young

10,137
4
32
53

Why the downvote? This answer fully satisfies the question *How can I use a script to change some text in an epub file?* – Zach Young Feb 17 '12 at 18:25
Using any kind of XML processing is extremely unattractive, as you don't know what valid markup will trigger other bugs in the nook. Targeted text editing is what is required here. – Marcin Feb 17 '12 at 18:28
@Marcin Can you qualify "unattractive"? – Zach Young Feb 17 '12 at 18:30
Read the rest of my sentence. – Marcin Feb 17 '12 at 18:32
@Marcin Can you prove this will not work? Marking the answer down because because it *may* cause problems seems unfair. I'm all for learning something new and taking a difference stance when presented with the facts, but as far as I can see, this is just speculation. – Zach Young Feb 17 '12 at 18:35
Here are the facts: there is a buggy XML processor; it is known that in at least one case it does not cope with valid markup; other than that documents in the form received by OP do not appear to have errors. It is also a fact that a conforming XSLT processor is permitted to alter the stream of characters it receives in a way that is not specified by the XSLT, if those changes result in XML with the exact same meaning. Accordingly, you risk making unwanted changes to the input document. Your solution is fundamentally unsafe. – Marcin Feb 17 '12 at 18:46

How can I use a script to change some text in an epub file?

4 Answers4