Merge pull request #45 from nitish-iiitd/master

Added a Simple Webpage Parser Wrapper
This commit is contained in:
Ayush Bhardwaj 2018-10-10 16:36:01 +05:30 committed by GitHub
commit 3340024680
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 32 additions and 0 deletions

View File

@ -0,0 +1,11 @@
# Simple Webpage Parser
A simple wrapper around the popular web scraper library BeautifulSoap. It merges the use of Requests and BeautifulSoap library in one class which abstracts the process of extraction of html from webpage's url and gives user a clean code to work with.
## Libraries Required
1. requests
`$pip install requests`
2. beautifulsoup4
`$pip install beautifulsoup4`
## Usage
A sample script `webpage_parser.py` has been provided to show the usage of the SimpleWebpageParser. It prints all the links from the Hacktoberfest's home page.

View File

@ -0,0 +1,13 @@
import requests
from bs4 import BeautifulSoup
class SimpleWebpageParser():
def __init__(self, url):
self.url = url
def getHTML(self):
r = requests.get(self.url)
data = r.text
soup = BeautifulSoup(data,"lxml")
return soup

View File

View File

@ -0,0 +1,8 @@
from SimpleWebpageParser import SimpleWebpageParser
swp = SimpleWebpageParser("https://hacktoberfest.digitalocean.com/")
html = swp.getHTML()
print html.find_all('a')
## the html returned is an object of type BeatifulSoup, you can parse using BeautifulSoup syntax
## refer to its documentation for more functionalities