7 bite-sized tips for reliable web automation and scraping selectors

7 bite-sized tips for reliable web automation and scraping selectors

If you’re like most developers, you’ve probably encountered Cascading Style Sheet (CSS) selectors for styling webpages. sea For example, the following CSS rule combines a paragraph element selector p with a class name selector .lead to set the font size:

p.lead {
  font-size: 32px;
}

We can mix-and-match CSS selectors to describe any subset of elements on a page. There are CSS selectors for HTML tag types, ids, classes, attributes, page structure, and even UX interactions.

Because of their expressiveness, CSS selectors are used everywhere in the web ecosystem:

  • Web Applications (Web API, JQuery): modifying the DOM, attaching event handlers, etc.
  • Automated Testing: finding elements to interact with, checking assertions, etc.
  • Web Scraping: selecting data to extract, finding links to traverse, etc.

Over at PixieBrix, selectors are an integral part of our web customization engine.

Unfortunately, if we don’t control a web page, writing reliable selectors can be an unpleasant experience. Some old-school WYSIWYG-built sites (I’m looking at you, Microsoft FrontPage) and modern Single Page Applications (SPAs) can be downright hostile to work with.

With PixieBrix, you can even create your own web clipper.

To help folks out, I’ve compiled a set of tips I find useful when writing selectors for enhancing, automating, and scraping 3rd-party sites.

Tip #0: Use JQuery extensions

When working in a browser context (normal or headless), use JQuery’s selector extensions. Selectors such as :eq , :header , and :containsmake many selections significantly more straightforward.

A downside is that JQuery extensions are slower because they can’t leverage the browser’s native CSS evaluation. However, for automation and scraping, you probably won’t ever notice. If it does become a problem, there are simple techniques to optimize JQuery selectors.

In addition to CSS+JQuery, it’s also helpful to learn a bit of XPath, an XML/HTML traversal language that browsers support natively.

Tip #1: Avoid structure-based selectors

Browsers and some libraries are able to automatically generate unique CSS selectors for elements.

Automatic generators are great because they can take advantage of the structure of the Document Object Model (DOM). Automatic generators are also terrible because they sometimes rely too much on the structure of the DOM.

Recently, a tool gave me the following selector:

:nth-child(5) > :nth-child(2) > :nth-child(2) > :nth-child(1) > :nth-child(3)

Structural selectors like these are extremely sensitive to small changes on the page, including non-visible changes. (For example: an element with display:none could be inserted before an element.)

These selectors will silently return the wrong element, and we’ll wind up debugging a failed automated test, or ingesting bad extracted data.

Therefore, whenever possible, I eschew overly-specific structural selectors in favor of ids, class names, attributes, and textual content.

Tip #2: Avoid dynamically generated attributes

In modern Single Page Applications (SPAs), element ids and class names are commonly dynamically generated.

Dynamic class names are computed when the code is compiled/bundled. That’s because the class names in the HTML file must match the names in separate CSS stylesheets. Dynamic class names will change whenever the site is updated. (Unless they are computed deterministically, e.g., using a hash. Then they’ll change whenever the style is changed).

For example, take the “Google Search” button on the Google homepage. It’s built with Google’s Closure Framework:

<input class="gNO89b" value="Google Search" name="btnK" type="submit">

At first glance, the name="btnK"also looks random, but it’s not. The “I’m Feeling Lucky” button button on the page has the name btnI, suggesting they’re human-picked. As a rule of thumb, inputname attributes can’t be dynamic because other parts of the application (either front-end, or backend) depend on them.

A common source of dynamic class names in applications is the use of CSS Modules. CSS Modules automatically encapsulating styles to avoid accidental styling collisions. Here’s an example header:

<h1 class="_styles__title_309571057">An example heading</h1>

While we can’t match the exact class name, it will follow a consistent naming pattern across updates of the site. Therefore, we can match on the pattern using CSS’s starts-with attribute selector:

h1[class^="_styles__title_"]

Dynamic ids, on the other hand, are managed by the SPA framework at runtime. Therefore, they’ll change for each run of the application. Take, for example, the profile header from LinkedIn, which uses the Ember.js framework:

<section id="ember1398" class="pv-top-card artdeco-card ember-view">
  <!-- more elements -->
</section>

Here, the ember1398 id isn’t random — the framework sequentially generates a new id for each element it renders. In practice, the id of an element will differ between page loads because elements don’t always load in the exact same order.

Tip #3: Search text with JQuery’s :contains selector

JQuery’s contains selector selects elements that contain the given text
anywhere within them. The selector is not available in plain CSS, but it is available as an XPath function.

For example, consider the following HTML:

<div>Lorem ipsum dolor sit amet, consectetur ...</div>

We could select this element with:

:contains("Lorem ipsum")

If there are other tags, we need to add additional selectors to clarify which element we want. For example, the above selector would select the div, p, and b elements in this document:

<div>
 <p><b>Lorem ipsum</b> dolor sit amet, consectetur...</p>
</div>

To select just the div, we’d need to specify the tag in the selector:

div:contains("Lorem ipsum")

In some cases, the text might contain HTML entities, e.g., a non-breaking space:

<div>Lorem&nsbp;ipsum</div>

Trying to select “Lorem ipsum” won’t match here. In these cases, we provide the contains selector multiple times to match only elements that contain both words:

div:contains("Lorem"):contains("ipsum")

Finally, a caveat/warning: when writing text selectors, be aware of which text is translated when the page is viewed in a different language.

Tip #4: Target selection with :has

Sometimes we need more precision in targeting a text search. In these cases we can combine a :contains selector within a CSS :has selector. Suppose we wanted to get the year a property was built from Apartments.com:

<div class="specList">
    <span><h3>Property Information</h3>
      <ul>
         <li><span class="bullet">•</span>Built in 2018</li>
         <li><span class="bullet">•</span>1016 Units/44 Stories</li>
      </ul>
    </div>
</div>

We can find the div with the Property Information header, and then grab the list item corresponding to the year the property was built:

.specList:has(h3:contains('Property Information')) ul > li:contains('Built')

The :has selector can also be used to find the containing element. The following selector matches a list containing an item with “Built”, rather than the individual list item:

ul:has(> li:contains("Built"))

Tip #5: Use ARIA attributes for more readable selectors

Accessible Rich Internet Applications (ARIA) attributes support the accessibility of a site, e.g., for providing context to screen readers. Using them for selectors also makes selectors more readable.

For example, the aria-label attribute labels elements that otherwise would depend on visual cues:

<button aria-label="Close" onclick="myDialog.close()">X</button>

These can be selected against using standard CSS attribute selectors, e.g.:

button[aria-label="Close"]

One thing to watch out for with the aria-label attribute is that it’s always internationalized/translated alongside other UX text.

ARIA attributes also provide information about the role of elements on the page (e.g., the role attribute) and even the relationship between elements on the page (e.g., aria-labelledby).

Tip #6: Be on the lookout for automated testing attributes

Developers using automated testing frameworks, e.g. Cypress, add data attributes to make their life easier when writing automated tests. Sometimes these attributes will find their way to the production site:

<div data-test-id="content"><!-- some content --></div>
<div data-cy="content"><!-- some content --></div>

Testing attributes can be selected just like any other attribute, and are often unique (just like an id):

[data-cy="content"]

Bonus Tip: handle flat key/value pairs with the adjacent sibling combinator

The adjacent sibling combinator (+) matches the element immediately following another element.

Supposed we have the following HTML, structured as multiple key-value pairs per row/list item:

<li class="row">
   <span class="k">Name:</span>
   <span class="v">Laura Smith</span>
   <span class="k">Year:</span>
   <span class="v">2013</span>
</li>

We can lookup the value for the year by finding the “Year:” label and then selecting the next element, which will be the value:

span.k:contains('Year:') + span

Want more?

If you want more tips like these don’t forget to follow @pixiebrix on Twitter, join our community on Slack

Photo by Christopher Gower / Unsplash