Getting Data from website HTML

Learned From:

https://www.codewithharry.com/videos/python-web-scraping-tutorial-in-hindi

My Written Source Codes:

https://paste.ubuntu.com/p/dfpwWwJjsG/

View codes from this link because medium DOT com doesn’t manage indentation.

Setup a virtual environment and activate it.

Install the packages

Initially type it and run it. This is the hello world of web scraping.

Output will be messy. Like this

You we will parse the html source code to view it in a good structure.

Output will be parsed, that means organized. Like this. This is the source code of that html.

Lets view the title of the html page

Lets view the data type. It’s not normal string. It must be datatypes provided by BS4

To get all paragraphs from the html page. find() will get the first one. find_all() will get all.

Output will be like this:

to find anchor tags <a href=…></a>

Getting first paragraph

Getting first element class

Getting first element id

Find all the element with class name lead

Getting text of a paragraph (here first) tag

Getting text all full page (without tags, only text)

Getting all links of this page

Getting all links of this page (BETTER)

Comment related parsing

Find a element with id

Find the children of an element (which is searched by id)

Contents are kind of same like children, but there are some differences.

Print content elements

Print children elements

Difference between.contents and .children

.contents = stores in memory (list)

.children = doesn’t store in memory (but can be saved manually)

Wise to use for loop in children. It will save much memory and run faster. Contents will be memory consuming, so this is not recommended.

To view the strings of an element (searched by id previously). This will show the strings without the tags.

For better view, use stripped_strings. It will reduce the spaces.

If we want to see the parent of an element.

To see all the parents of the element. All parents means all of upward.

To get parents item name (actually tag names)

To get the next or previous sibling of an element

Select an element by an id. # is used to get id in javascript.

To select a class with its name. DOT is used to get a class with class name.

For Documentation

CS Undergrad | BUET

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store