parsing a complex text file

Question

savetonotherfile.write(
        openfileagain.read().replace(
            "b'<HTML>\n<HEAD>\n<TITLE> Euro Millions Winning Numbers</TITLE>\n<BODY>\n<PRE> Euro Millions Winning Numbers\n\nNo., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins\n",
            '').replace(
            "\n<HR><B>All lotteries below have exceeded the 180 days expiry date</B><HR>No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins\n",
            '').replace(
            "\n\nThis page shows all the draws that used any machine and any ball set in any year.\n\nData obtained from http://lottery.merseyworld.com/Euro/\n</PRE>\n</BODY></HTML>\n'",
            ''))

I am trying to use the above line to delete text from a text file in the format b'<HTML>\n<HEAD>\n<TITLE> Euro Millions Winning Numbers</TITLE>\n<BODY>\n<PRE> Euro Millions Winning Numbers\n\nNo., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2, Jackpot, Wins\n562, Fri, 8,Feb,2013, 09,11,14,34,44,10,11, 27886637, 0\n561, Tue, 5,Feb,2013, 06,25,31,40,45,06,07, 19070109, 0\n560, Fri, 1,Feb,2013, ... some text to delete, more numbers, some more text to delete. The .replace() is not doing anything or at least what is written to the write file is identical to the read file. What have I done wrong? I also want to delete the long integer and subsequent text up to the comma after the date, but haven't even started on that hurdle, since I cannot even accomplish the simplest thing!

There are plenty of modules for parsing xml and html. Do yourself a favor and use one of them... — StoryTeller - Unslander Monica, Feb 11 '13 at 14:51

score 0 · Accepted Answer · answered Feb 11 '13 at 14:51

0

Add r before the string literals in the first argument of replace. Or change \n to \\n.

answered Feb 11 '13 at 14:51

Ray

1,647
13
16

Worked perfectly i.e.change to \\n. Not sure where to add r. Would you mind being more explicit. However thank you – user1478335 Feb 11 '13 at 15:00
For example, `r"raw string with \n"`. It's a special syntax in Python. – Ray Feb 11 '13 at 15:03

score 0 · Answer 2 · answered Feb 11 '13 at 14:53

Its not really a good idea to try to work with html like this - its usually better to use an html parsing module such as beautifulsoup (assuming that is html - see my edit below). Either way, you will be able to find the bug much more easily if you break your code up into smaller steps, and factor out the long replacement strings. E.g.:

replace_map = (('first string', 'replace with this'),
               ('second string', 'replace the second with this'))

with open(inputfilename, 'rt') as infile:
    output = infile.read()
    for fromstr, tostr in replace_map:
        output = output.replace(fromstr, tostr)

with open(outputfilename, 'wt') as outfile:
    outfile.write(output)

Edit: After posting my answer I noticed that you seem to be parsing text of the form "b'<html code/>'" Is this correct? It looks like you have a string describing a python bytes object. If that is really what you're doing then html parsing won't help you, but I would suggest you seriously question why you're doing it and decide if it is the best way to achieve the end result.

Thank you for this. I shall try to work with this as well. Need to try it out — user1478335, Feb 11 '13 at 15:24

score 0 · Answer 3 · answered Feb 11 '13 at 16:36

0

For complex manipulations of texts, the evidence is that one MUST use regular expressions.
I urge you to study the re module. You'll obtain more satisfaction than tinkering with replace()

Concerning the code you gave, the execution does that:
- taking the text in the file of handler openfileagain : that creates a string #1
- replacing a portion of this text, id est of this string #1: that creates a new string #2
- replacing a second portion of the text, that is to say replacing the said portion present in string #2: that creates a third string #3
- replacing a third portion, that is to say replacing this portion present in string #3: that creates a string #4.

While with a regular expression, you'll give the information consisting of the 3 portions to replace and the re machinery will directly creates the same string #4 from string #1, without having to pass through strings #2 and #3.

answered Feb 11 '13 at 16:36

eyquem

26,771
7
38
46

Thank you. I shall study the re module as you recommend. I still have difficulty returning from text files exactly what I want, thus this exercise that I have set myself. Really want to be able to parse anything. This is just a handy set of numbers and text to use. – user1478335 Feb 11 '13 at 18:38
1

@user1478335 I extend my advice. Regular expressions are somewhat difficult. And they don't fit the best for certain kinds of analysis. There are a lot of parsers and data analysis tools that can help more rapidly, more easily and more confidently than regular expressions. In the case you exposes, however, I would use a regex because your aim is simple. – eyquem Feb 11 '13 at 19:19
@user1478335 In addition to eyquem's comment see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – aquavitae Feb 12 '13 at 06:01

parsing a complex text file

3 Answers3