Questions tagged [web-mining]

Web mining is the application of data mining techniques to discover patterns from the World Wide Web.

Web mining is the application of data mining techniques to discover patterns from the World Wide Web. Web mining can be divided into three different types:

  1. Web usage mining;
  2. Web content mining;
  3. Web structure mining.
42 questions
16
votes
3 answers

Good dataset for sentiment analysis?

I am working on sentiment analysis and I am using dataset given in this link: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html and I have divided my dataset into 50:50 ratio. 50% are used as test samples and 50% are used as train…
user3512562
  • 233
  • 2
  • 3
  • 7
8
votes
5 answers

Fast internet crawler

I'd like to do perform data mining on a large scale. For this, I need a fast crawler. All I need is something to download a web page, extract links and follow them recursively, but without visiting the same url twice. Basically, I want to avoid…
pbp
  • 1,461
  • 17
  • 28
3
votes
3 answers

Java API for web scraping or web mining

I'm looking for a good Java api to do web scraping. I tried WEB-Harvest api http://web-harvest.sourceforge.net/usage.php but I think it's a bit clunky. Any other suggestions?
finfinni
  • 39
  • 1
  • 2
3
votes
1 answer

Python Mechanize - How to submit an unlisted value in dropdown menu

I am using Python's mechanize to add items into an Amazon shopping cart. On an item's product page, you select the Quantity in the form's dropdown menu and submit Add to Cart. The dropdown menu allows you to select Quantities from 1 through 30. The…
Max
  • 65
  • 5
3
votes
1 answer

Programmatically look up a ticker symbol in R

I have a field of data containing company names, such as company <- c("Microsoft", "Apple", "Cloudera", "Ford") > company Company 1 Microsoft 2 Apple 3 Cloudera 4 Ford and so on. The package tm.plugin.webmining allows you to query data from…
Hack-R
  • 22,422
  • 14
  • 75
  • 131
2
votes
6 answers

Web mining or scraping or crawling? What tool/library should I use?

I want to crawl and save some webpages as HTML. Say, crawl into hundreds popular websites and simply save their frontpages and the "About" pages. I've looked into many questions, but didn't find an answer to this from either web crawling or web…
Flake
  • 4,377
  • 6
  • 30
  • 29
2
votes
1 answer

Is there any web mining Library in Node.js for sentiment analysis?

I am doing sentiment analysis in Javascript using Node.js. I am looking for web mining packages in Node to clean a web page. Is there any built-in package for web mining in Node like we have in R tm.plugin.webmining Package? Thank you
2
votes
2 answers

Scraping data from a dynamic ecommerce webpage

I'm trying to scrap the titles of all the products listed on a webpage of an E-Commerce site(in this case, Flipkart). Now, the products that I would be scraping would depend of the keyword entered by the user. A typical URL generated if I entered a…
Manas Chaturvedi
  • 5,210
  • 18
  • 52
  • 104
2
votes
1 answer

Any better pre processing library or implementation in python?

I need to pre-process some text documents so that I can apply classification techniques like fcm e.t.c and other topic modeling techniques like latent dirichlet allocation e.t.c To elaborate a bit in preprocessing I need to remove the stop words,…
Kai
  • 953
  • 6
  • 16
  • 37
1
vote
1 answer

Difficulty in extracting main content from a news web page

I need to extract main contents (excluding links,advertisements,etc) from a news web page.I have read about it on web and came to know that to do that I need to parse html page and then select contents from html tags.I have written a code which…
dark_shadow
  • 3,503
  • 11
  • 56
  • 81
1
vote
1 answer

How can I use scrapy on booking.com without being blocked?

I am trying to scrape hotel reviews from booking.com with the python plugin scrapy. My problem is, that the desired data (e.g. negative feedbacks) can't be found by scrapy. I think, it's because of the javascript code embedded in the…
Julia
  • 11
  • 1
1
vote
0 answers

Craw data from urls by passing URL to Scrapy from other *.py file

I'm using Scrapy to craw data from website, and this is my code at file spider.py in folder spider of Scrapy class ThumbSpider(scrapy.Spider): userInput = readInputData('input/user_input.json') name = 'thumb' # start_urls =…
Claire Duong
  • 103
  • 1
  • 7
1
vote
1 answer

How to get text and href value in anchor tag with scrapy, xpath, python

I have a HTML file like this: In the folder spiders, I have a file jokes.py like this: import scrapy from…
Claire Duong
  • 103
  • 1
  • 7
1
vote
1 answer

Rcrawler - How to crawl account/password protected sites?

I am trying to crawl and scrape a website's tables. I have an account with the website, and I found out that Rcrawl could help me with getting parts of the table based on specific keywords, etc. The problem is that on the GitHub page there is no…
1
vote
0 answers

Twitter streaming API, where to find originator's name?

I am using Python to stream Twitter's Tweets via API. For example, the word "car" generates the following results: { "created_at": "Fri Sep 05 00:15:32 +0000 2014", "id": 507683414255108096, "id_str": "507683414255108096", "text": "I…
KubiK888
  • 4,377
  • 14
  • 61
  • 115
1
2 3