Web Scraping HandBook
Getting Data from website HTML
Learned From:
https://www.codewithharry.com/videos/python-web-scraping-tutorial-in-hindi
My Written Source Codes:
https://paste.ubuntu.com/p/dfpwWwJjsG/
View codes from this link because medium DOT com doesn’t manage indentation.
Setup a virtual environment and activate it.
virtualenv myenv
.\myenv\Scripts\activate
Install the packages
pip install requests
pip install bs4
pip install html5lib
Initially type it and run it. This is the hello world of web scraping.
import requestsfrom bs4 import BeautifulSoup
url = "https://codewithharry.com/"r = requests.get(url)htmlContent = r.contentprint(htmlContent)
Output will be messy. Like this
You we will parse the html source code to view it in a good structure.
soup = BeautifulSoup(htmlContent, 'html.parser')print(soup)
Output will be parsed, that means organized. Like this. This is the source code of that html.
Lets view the title of the html page
title = soup.titleprint(title)print(type(title))
Lets view the data type. It’s not normal string. It must be datatypes provided by BS4
print(title.string)print(type(title.string))
print(type(soup))print(type(title))print(type(title.string))
To get all paragraphs from the html page. find() will get the first one. find_all() will get all.
paras = soup.find_all('p')print(paras)
Output will be like this:
to find anchor tags <a href=…></a>
anchors = soup.find_all(‘a’)print(anchors)
Getting first paragraph
print(soup.find(’p’))
Getting first element class
print(soup.find(’p’)[’class’])
Getting first element id
print(soup.find(’p’)[’id’])
Find all the element with class name lead
print(soup.find_all(‘p’, class_=’lead’))
Getting text of a paragraph (here first) tag
print(soup.find(’p’).get_text())
Getting text all full page (without tags, only text)
print(soup.get_text())
Getting all links of this page
anchors = soup.find_all('a')for link in anchors:print(link.get('href'))
Getting all links of this page (BETTER)
anchors = soup.find_all('a')all_links = set()for link in anchors:if(link.get('href') != '#'):linkText = "https://codewithharry.com"+link.get('href')all_links.add(linkText)print(all_links)
Comment related parsing
Find a element with id
navbarSupportedContent = soup.find(id='navbarSupportedContent')print(navbarSupportedContent)
Find the children of an element (which is searched by id)
navbarSupportedContent = soup.find(id=’navbarSupportedContent’)print(navbarSupportedContent.children)
Contents are kind of same like children, but there are some differences.
navbarSupportedContent = soup.find(id='navbarSupportedContent')print(navbarSupportedContent.contents)
Print content elements
for elem in navbarSupportedContent.contents:print(elem)
Print children elements
for elem in navbarSupportedContent.children:print(elem)
Difference between.contents and .children
.contents = stores in memory (list)
.children = doesn’t store in memory (but can be saved manually)
Wise to use for loop in children. It will save much memory and run faster. Contents will be memory consuming, so this is not recommended.
To view the strings of an element (searched by id previously). This will show the strings without the tags.
for item in navbarSupportedContent.strings:print(item)
For better view, use stripped_strings. It will reduce the spaces.
for item in navbarSupportedContent.stripped_strings:print(item)
If we want to see the parent of an element.
print(navbarSupportedContent.parent)
To see all the parents of the element. All parents means all of upward.
for item in navbarSupportedContent.parents:print(item)
To get parents item name (actually tag names)
for item in navbarSupportedContent.parents:print(item.name)
To get the next or previous sibling of an element
print(navbarSupportedContent.next_sibling)print(navbarSupportedContent.next_sibling.next_sibling)print(navbarSupportedContent.previous_sibling)print(navbarSupportedContent.previous_sibling.previous_sibling)
Select an element by an id. # is used to get id in javascript.
elem = soup.select(‘#loginModal’)print(elem)
To select a class with its name. DOT is used to get a class with class name.
elem = soup.select('.modal-footer')print(elem)