0

There is some text I need from a web page, a page whose length changes somewhat from day to day. I am looking to download that text periodically. I do not want/need several dozen lines from both the beginning and end of the roughly 250 line page. The total number of lines on the page will be unpredictable, so I will be needing to establish beginning/end points for the deletion I wish to perform based on bits of text that do not change from day to day. I've identified the target text patterns, so I'm looking to parse the content based on those such that the unwanted lines get deleted in the resulting document. I'm wanting to use command line utilites for this since I would like to automate the process and make a cron job out of it.

The download method of choice is to use lynx -dump www.specified.url my-download.txt

That part is working fine. But processing the dump so as to cut off the unwanted beginning and ending lines is so far not working. I found a sed example that, it seems, should do what I need:

sed -n '/Phrase toward the beginning/,/Phrase toward the end/p' file_to_parse.txt >parsed_file.txt

It works partially, meaning it cuts off the file's beginning at the right point (all lines preceding "Phrase toward the beginning"). But I cannot seem to make it cut lines from the end, i.e., lines following the phrase "Phrase toward the end." All my attempts using this formula have so far not touched the end of the file at all. I should probably mention that most of the lines in the dumped file lynx produces begin, for whatever reason, with 3 blank spaces--including the "Phrase toward the end" line I'm trying to specify as the point after which further lines should be deleted.

I assume there may be more than one utility that can do the sort of parsing I'm after--sed and awk are the likely candidates I can think of. I tend to gravitate toward sed since its workings are slightly less mysterious to me than are awk's. But truth be told, I really only have the vaguest of conceptions as to how to use sed. When it comes to using and/or understanding awk, I get lost very, very quickly. Perhaps there are other utilities that can, based on textual patterns, lop off portions of the beginning and ending of a text file?

Input on how I might use sed, awk--or any other similar utility--to accomplish my goal, will be appreciated. This is to be done on an Ubuntu machine, btw.

LATER EDIT: sorry for not having posted and example. The downloaded page will look something like the following

Unwanted line 1
Unwanted line 2
Unwanted line 3
Unwanted line etc
Phrase toward the beginning
Wanted line 1
Wanted line 2
Wanted line 3
Wanted line ca 4-198
Phrase toward the end
Unwanted line 200
Unwanted line 201
Unwanted line 202
Unwanted line . . . (to end of file)

The final output should look, on the other hand, like

Phrase toward the beginning
Wanted line 1
Wanted line 2
Wanted line 3
Wanted line ca 4-198
Phrase toward the end

I hope things will be clearer now. Please do bear in mind, though I've used line numbers to help better illustrate what I'm aiming to do, that I will be unable to do the desired deletions based on line numbers owing to the unpredictable ways in which the page I'm downloading will be changing.

MJiller
  • 149
  • 9
  • please post a small test input with the expected result. Story form is fine but data is better. – karakfa Oct 04 '16 at 02:37
  • 2
    instead of writing an essay, it would help if you just give examples, see http://stackoverflow.com/help/mcve – Sundeep Oct 04 '16 at 02:37
  • Your command works for me with your example input. Your real input must contain something not reflected in the example. Is the end phrase spanning multiple lines, maybe? – Benjamin W. Oct 04 '16 at 03:01
  • The only difference between my actual file and the sample given here is the presence of 3 blank spaces at the beginning of most lines in my dump file. I couldn't readily determine how insert those in the sample. "Phrase toward the beginning" is not preceded by 3 spaces in my actual file, but "Phrase toward the end" is. I tried removing those 3 spaces and running the sed command on my file to see whether the spaces might be causing a problem, but the lines after "Phrase toward the end" did not get deleted, even after I'd deleted those spaces. – MJiller Oct 04 '16 at 03:10
  • so did you try `/..../,/^ Phrase toward the end/p` ? (Note 3 spaces at front of `end` match). In the future, please put the essay at the end of your Q. All we need is sample input, expected output, current code/output/error messages and then any research you have done. Good luck. – shellter Oct 04 '16 at 03:12
  • Deleting the lines outside the match is equivalent to only printing the lines between the two matches. – tripleee Oct 04 '16 at 03:26
  • This query is, indeed, answered elsewhere. I do not have the same need the poster there was stipulating (extracting text from multiple occurrences of the patterns within a given file), but the sample provided in the OP there does allow me to accomplish the goal at which I was aiming. I'm not sure how I missed that since I did quite a bit of searching, both using google and as I was preparing to post here. It was also interesting to find out about the grep option listed below, at least. – MJiller Oct 04 '16 at 04:06

1 Answers1

1

If sed seems too difficult to debug, consider a double grep; for example here we list numbers 1 to 250, then grep for 70, plus up to 1000 lines after that, then grep for 80, plus up to 1000 lines before that:

seq 250 | grep -A 1000 '^70$' | grep -B 1000 '^80$'

Output:

70
71
72
73
74
75
76
77
78
79
80

Since the maximum length of the input files is known, 1000 is a safe number for your data (but overkill for the above example).

Applied to the OP data, the example would become:

grep -A 1000 'Phrase toward the beginning' download_page.txt | \
grep -B 1000 'Phrase toward the end'

The debugging advantage over sed is that the error messages from grep are simpler than those from sed.

agc
  • 7,973
  • 2
  • 29
  • 50
  • Thank you. This may work for me. It wasn't at first but, as you state, I was getting an error message I could follow up on--about this being a binary file. Looking over the grep man page, I found the -a switch, which will "Process a binary file as if it were text." Once I added that switch to the command, it seems to work as advertised. – MJiller Oct 04 '16 at 03:37