Guada Gonzalez: Introduction to Selenium with Twitter Data

Guada Gonzalez

Scraping means ‘to gather, collect’ and it’s a very important tool that we social scientists have to acquire innovative and valuable information that not everyone has access to.

In general, we scrape web pages using two methods. One is based on HTML code, which is simpler, and the other is based on Javascript, which is a bit more complex. For scraping HTML, there are very good tutorials like this, this, and this, using packages like Beautifulsoup in Python and rvest in R. In this case, we will use Selenium in Python to obtain the number of followers and followings that legislators have in Argentina.

It’s not necessary to know Javascript to use Selenium, but it is necessary to be able to identify, in this case, classes within the source codes of the websites. That is, to understand a little bit of HTML and CSS.

Disclaimer to expand the issue:

In HTML, elements have an ID and a class. An “ID” is ___ and a class is “__“. In this case, IDs are indicated with”#” in front and classes with a dot “.” in front.

Downloading the number of followers and followings of legislators. For this, we will first try with a single account and then incorporate it into a loop.

For this, the following packages are necessary:

import random
import BeautifulSoup
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import re
import json

Thirdly, we call the driver. What is this?_ When we run it automatically, we’ll see that an internet tab opens: [Image]

Then we specify the URL we want to scrape together with the get() function that will send us to the web. In this case, for example, we want to scrape my Twitter account.

driver_ = webdriver.Chrome()

url = 'https://twitter.com/guadag12'
driver_.get(url)
sleep(2)  # Adding sleep to ensure page loads. Consider using WebDriverWait for better practice.

WebDriverWait(driver_, 10).until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".css-1rynq56.r-bcqeeo.r-qvutc0.r-37j5jr.r-a023e6.r-rjixqe.r-16dba41.r-1loqt21"))
)

Once we run it, we need to see how it opens the new tab. ! Warning: It’s always good practice to give seconds of rest between visiting a website and scraping it, so we give the browser time to load and likewise, we don’t overload the website. Using the “sleep()” function implies being responsible with scraping use because ultimately, it’s a tool that forces the extraction of information in a context where we don’t necessarily ask the company for information. So let’s be responsible and respectful and put a sleep occasionally.

Then we search on the web for the id/class of the element we want to scrape. In this case, as we want the number of followers and followings, we use this:

We use the “find_elements()” function and ask it to bring us an element determined by a CSS feature such as class, and we specify the class as a string: [Image]

What it will automatically do is bring that data, and then we ask it to print it, and it will return something like this.

h4_elements = driver_.find_elements(By.CSS_SELECTOR, ".css-1rynq56.r-bcqeeo.r-qvutc0.r-37j5jr.r-a023e6.r-rjixqe.r-16dba41.r-1loqt21")
type_projects = [element.text.strip() for element in h4_elements]
print(type_projects)

parsed_numbers = [parse_number(num) for num in type_projects]

following = parsed_numbers[0]
followers = parsed_numbers[1]

user_data_new = {
    "User": url,
    "Following": following,
    "Followers": followers
}

Once we finish, we close the session:

driver_.quit()

LOOP with Twitter accounts of Argentine legislators:

What if we want to extract data from a large number of accounts? We need to Bring the dataset with the accounts of Argentine legislators Create an empty list to fill in later and Perform a for loop where in each iteration the information of each user is obtained

But before showing how the code would look like to do this, it’s important to know first that there are cases where legislators appear with “1.1k followers” or “1.3M followers” (millions). What we want to do is replace those “k” or “M” with the corresponding number of zeros. In this regard, this function is created to be used within the loop:

Now, this is how the final loop would look like:

And in this file, you can observe the final data:

For example, if we want to see which block has the highest number of followers on Twitter according to the number of block members, we see that:

This has been everything. Thank you for getting here and reading to the end,

Leave comments for any questions,

Guada

Introduction to Selenium with Twitter Data

Citation