how to parse the documents using Crawlers

Question

I am new to this topic, but my requirement is to parse documents of different types(Html, pdf,txt) using a crawlers. please suggest me what crawler to use for my requirement and provide me some tutorial s or some example how to parse the document using crawlers.

Thankyou.

What operating system are we talking about? Are the files on different web servers or on a local file system? — Chief Wiggum, May 10 '13 at 04:27
thanks for the reply, i am using linux os, and all files are on local file system only. — user2353439, May 10 '13 at 04:29
I just want to verify that you do indeed mean crawlers and not scrapers. Crawlers are typically used for indexing. See http://stackoverflow.com/questions/4327392/crawling-vs-web-scraping and http://stackoverflow.com/questions/3207418/crawler-vs-scraper. — Dennis, May 10 '13 at 07:29

score 2 · Accepted Answer · answered May 10 '13 at 05:00

This is a very broad question, so my answer is also very broad and only touches the surface.
It all comes down to two steps, (1) extracting the data from its source, and (2) matching and parsing the relevant data.

1a. Extracting data from the web

There are many ways to scrape data from the web. Different strategies can be used depending if the source is static or dynamic.

If the data is on static pages, you can download the HTML source for all the pages (automated, not manually) and then extract the data out of the HTML source. Downloading the HTML source can be done with many different tools (in many different languages), even a simple wget or curl will do.

If the data is on a dynamic page (for example, if the data is behind some forms that you need to do a database query to view it) then a good strategy is to use an automated web scraping or testing tool. There are many of these. See this list of Automated Data Collection resources [1]. If you use such a tool, you can extract the data right away, you usually don't have the intermediate step of explicitly saving the HTML source to disk and then parsing it afterwards.

1b. Extracting data from PDF

Try Tabula first. It's an open source web application that lets you visually extract tabular data from PDFs.

If your PDF doesn't have its data neatly structured in simple tables or you have way too much data for Tabula to be feasible, then I recommend using the *NIX command-line tool pdftotext for converting Portable Document Format (PDF) files to plain text.

Use the command man pdftotext to see the manual page for the tool. One useful option is the -layout option which tries to preserve the original layout in the text output. The default option is to "undo" the physical layout of the document, and instead output the text in reading order.

1c. Extracting data from spreadsheet

Try xls2text for converting to text.

2. Parsing the (HTML/text) data

For parsing the data, there are also many options. For example, you can use a combination of grep and sed, or the BeautifulSoup Python library` if you're dealing with HTML source, but don't limit yourself to these options, you can use a language or tool that you're familiar with.

When you're parsing and extracting the data, you're essentially doing pattern matching. Look for unique patterns that make it easy to isolate the data you're after.

One method of course is regular expressions. Say I want to extract email addresses from a text file named file.

egrep -io "\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b" file

The above command will print the email addresses [2]. If you instead want to save them to a file, append > filename to the end of the command.

[1] Note that this list is not an exhaustive list. It's missing many options.
[2] This regular expression isn't bulletproof, there are some extreme cases it will not cover. Alternatively, you can use a script that I've created which is much better for extracting email addresses from text files. It's more accurate at finding email addresses, easier to use, and you can pass it multiple files at once. You can access it here: https://gist.github.com/dideler/5219706

Thanks for the answer, it has given me some idea to meet my requirement. — user2353439, May 10 '13 at 05:10
@user2353439 you're welcome. There are many tutorials online you can read once you decide on a specific tool set to use. If you're satisfied with my answer, please mark it as the accepted answer so others who view this thread know it's been answered. — Dennis, May 10 '13 at 05:33
answer accepted, but can u suggest me which tool is better to use, can i use apache nutch for my requirement — user2353439, May 10 '13 at 05:36
@user2353439 Which tool is better to use depends on what you already know and are familiar with, the sources of your data, the environment you're working in, etc. In other words, there are too many variables for me to recommend a specific tool for you. If you narrow down your question so it's more specific I could provide more help. - I took a look at Apache Nutch and I don't think it'll be useful for you (unless I misunderstood what you're trying to do). Nutch is an open source web search engine, it's overkill if all you want to do is scrape some data. — Dennis, May 10 '13 at 07:20
@user2353439 What data are you trying to extract from an FLV file? Are you trying to download FLV files on the internet? Are you trying to take screenshots of the video? Are you trying to extract metadata from the file? — Dennis, May 10 '13 at 07:21
@user2353439 I suggest creating a new question on SO for that. Or search for "extract metadata from flv" on Google. If you're interested in a specific language/tool, you can search for "python flv metadata" for example. edit: there's already a question for it on SO, http://stackoverflow.com/questions/3761358/flash-flv-how-do-i-extract-metadata-from-an-flv — Dennis, May 10 '13 at 07:36

how to parse the documents using Crawlers

1 Answers1

1a. Extracting data from the web

1b. Extracting data from PDF

1c. Extracting data from spreadsheet

2. Parsing the (HTML/text) data