Skip to main content

Web Scraping

 Web Scraping

Some websites don't have lovely APIs for us to interface with.


If we want data from these pages, we have to use a tecnique called scraping. This means downloading the whole webpage and poking at it until we can find the information we want.


You're going to use scraping to get the top ten restaurants near you.


Get started

👉 Go to a website like Yelp and search for the top 10 reastaurants in your location. Copy the URL.

 


url = "https://www.yelp.co.uk/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA%2C+United+States"  


Import libraries
👉 Import your libraries. Beautiful soup is a specialist library for extracting the contents of HTML and helping us parse them. Run the Repl once your imports are sorted because we want the Beautiful Soup library to be installed (it'll run quicker this way).

import requests
from bs4 import BeautifulSoup

url = "https://www.yelp.co.uk/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA%2C+United+States"  

Webpage as text
👉 Use requests to get the webpage as text. When printing the html, we get back just how much info there is contained in the page.

import requests
from bs4 import BeautifulSoup
url = "https://www.yelp.co.uk/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA%2C+United+States"
response = requests.get(url)
html = response.text
print(html)  

More Scraping
👉 Next, we pass the html to Beautiful Soup to make more sense of it.

html.parser will scan through the HTML recognizing tokens in the text and breaking it down into something more meaningful.  

import requests
from bs4 import BeautifulSoup

url = "https://www.yelp.co.uk/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA%2C+United+States"

response = requests.get(url)
html = response.text

soup = BeautifulSoup(html, 'html.parser')  

Inspect
👉 Back on Yelp, right click the first (non-sponsored) restaurant on the list and inspect it.

Right click and copy element for the whole URL, then paste it into your repl as a temporary measure.  




import requests
from bs4 import BeautifulSoup

url = "https://www.yelp.co.uk/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA%2C+United+States"

response = requests.get(url)
html = response.text

soup = BeautifulSoup(html, 'html.parser')

<a href="/biz/marufuku-ramen-san-francisco-5?osq=Restaurants" class="css-1m051bw" target="_blank" name="Marufuku Ramen" rel="noopener">Marufuku Ramen</a>  


Inspecting the link gives us clues about what we want beautiful soup to look for. In this case, I want it to look for <a> tags and the class css-1m051bw.

Store results
👉 I've created a new variable to store the result of the beautiful soup search.

find_all takes two arguments. The first is the a tag. The second is a dictionary that tells it what class to search for. This effectively says 'find me all the a tags with this class in them.'
I've printed the len of those results to see how many I get back.
import requests
from bs4 import BeautifulSoup
url = "https://www.yelp.co.uk/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA%2C+United+States"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
myLinks = soup.find_all("a", {"class":"css-1m051bw"})
print(len(myLinks))


Loop it
Now I'll use a loop to output all the links.  

import requests
from bs4 import BeautifulSoup

url = "https://www.yelp.co.uk/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA%2C+United+States"

response = requests.get(url)
html = response.text

soup = BeautifulSoup(html, 'html.parser')

myLinks = soup.find_all("a", {"class":"css-1m051bw"})

print(len(myLinks))

for link in myLinks:
    print(link.text)  

👉 You'll see that the same tag has been used in the info about the location and category, so they're included in the results.

I'm going to use a loop counter to start the print loop at item 3 to leave them off the output.
I'm also going to include the link to the restaurant in the output by using a dictionary.  

 



counter = 0
for link in myLinks:
  if counter > 1:
    print(link.text)
    print(link["href"])
  counter +=1  


Add an fstring
The web links are local, they aren't relative to the site I'm using, so I've formatted the print(link["href"]) as an fString to add the relative address (that I found in the Yelp inspect code).

counter = 0
for link in myLinks:
  if counter > 1:
    print(link.text)
    print(f"""https://www.yelp.com{link["href"]}""")
  counter +=1  








Comments

Popular posts from this blog

Automate! Automate!

 Making this customizable 👉So how about making our search user customizable? In the code below, I have: Asked the user to input an artist (line 14) Tidied up their input (line 15) formatted the search URL as an fString that includes the artist (line 19) Here's tAutomate! Automate! We are so close. I can taste it, folks! Massive kudos on getting this far! Today's lesson, however, will work best if you have one of Replit's paid for features (hacker plan or cycles). Free plan Repls 'fall asleep' after a while. Automation kinda relies on the Repl being always on. If you have hacker plan or you've bought some cycles, then you can enable always on in the drop down menu that appears when you click your Repl name (top left).he code: This is important because when our repl is always running, it can keep track of time and schedule events. 👉 I've set up a simple schedule that prints out a clock emoji every couple of seconds. It works like this: Import schedule librar...

Client/Server Logins

 Client/Server Logins Waaay back when we learned about repl.db, we mentioned the idea of a client/server model for storing data in one place and dishing it out to multiple users. This model is the way we overcome the issue with repl.db of each user getting their own copy of the database. Well, now we can use Flask as a webserver. We can build this client server model to persistently store data in the repl (the server) and have it be accessed by multiple users who access the website via the URL (the clients). Get Started Previously, we have built login systems using Flask & HTML. We're going to start with one of those systems and adapt it to use a dictionary instead. 👉 First, let's remind ourselves of the way the system works. Here's the Flask code. Read the comments for explanations of what it does: from flask import Flask, request, redirect # imports request and redirect as well as flask app = Flask(__name__, static_url_path='/static') # path to the static fil...

Subroutine

  Subroutine A  subroutine  tells the computer that a piece of code exists and to go run that code again and again ... EXAMPLE : def rollDice():   import random   dice = random.randint(1, 6)   print("You rolled", dice)  Call the Subroutine   We need to 'call' the code by adding one more line to our code with the name of the subroutine and the empty  () :  EXAMPLE : def rollDice():   import random   dice = random.randint(1, 6)   print("You rolled", dice) rollDice()