jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

Overview

jsoup: Java HTML Parser

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

  • scrape and parse HTML from a URL, file, or string
  • find and extract data, using DOM traversal or CSS selectors
  • manipulate the HTML elements, attributes, and text
  • clean user-submitted content against a safe-list, to prevent XSS attacks
  • output tidy HTML

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

See jsoup.org for downloads and the full API documentation.

Build Status

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the News section into a list of Elements:

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
  log("%s\n\t%s", 
    headline.attr("title"), headline.absUrl("href"));
}

Online sample, full source.

Open source

jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub.

Getting started

  1. Download the latest jsoup jar (or add it to your Maven/Gradle build)
  2. Read the cookbook
  3. Enjoy!

Development and support

If you have any questions on how to use jsoup, or have ideas for future development, please get in touch via the mailing list.

If you find any issues, please file a bug after checking for duplicates.

The colophon talks about the history of and tools used to build jsoup.

Status

jsoup is in general, stable release.

Issues
  • Debian looking to backport parser changes for CVE-2021-37714

    Debian looking to backport parser changes for CVE-2021-37714

    Hello,

    Thank you for your work on jsoup. However, since we have CVE-2021-37714 (which is fixed in the latest release), I'd like to backport the fixes to older releases (in Debian). To do that, I need to know the relevant commits that are sufficient to be backported for fixing the mentioned CVEs. On a quick look, it wasn't clear which ones are they, could you please point me to them? TIA! \o/

    discussion 
    opened by utkarsh2102 3
  • merging HEAD into BODY while use white list

    merging HEAD into BODY while use white list

    replace

    Document dirty = parseBodyFragment(bodyHtml, baseUri);
    

    to

    Document dirty = parse(bodyHtml, baseUri);
    

    Because that's what causes the head and body to blend together. Then consider whether the head and body are in the white list respectively.

    opened by Ruefors 0
  • Support different whitespace characters

    Support different whitespace characters

    #1550 Improve the parser. It can detect three kinds of types of space in Unicode (i.e., \u00A0, \u0020, \u3000) now.

    opened by Paramecium0 1
  • Fix issue#1446

    Fix issue#1446

    Fix issue#1446. HTML class names are case-sensitive, while our CSS selectors are case-insensitive. So I added a boolean variable to make the selection for Class can be customized by the user whether caseSensitive. This function is currently only for Class objects, if needed I can also add this option for other objects. I am a student taking a software engineering course and trying to fix Jsoup issues is part of my homework. My code may not be well written, thank you for your understanding, and welcome for any suggestions.

    opened by Ryderxxx 0
  • implement method replaceAll, instead of hiding non supported operations

    implement method replaceAll, instead of hiding non supported operations

    #1514 When users want to replace all tags with other tags, they may use replaceAll(). But this will not work at all because the original method replaceAll() is inherited from ArrayList on Elements. To solve this issue, both implementing the method replaceAll() and hiding non-supported operations work. By using for loop and lambda, replaceAll() is implemented.

    opened by Paramecium0 0
  • Change how empty attribute names are handled

    Change how empty attribute names are handled

    When parsing <div ="">, the parser no longer treats " ="" " whole as the key of the attribute, but "","" as the key and val. " ="" " will be retained if we use HTML, but not if we use XML

    opened by cqn2219076254 1
  • Fix #1492:Attributes equals now is order sensitive

    Fix #1492:Attributes equals now is order sensitive

    Since Attributes are implemented by map, so the comparison can be realized by calling equals() for map in the function equals() in Attributes.java

    opened by zoey-shan 0
  • Implement Elements.replaceAll

    Implement Elements.replaceAll

    Summary I implement replaceAll mentioned in issue #1514 , by overriding the replaceAll in class Elements. Also, I add a test to make sure the replacement happens as expected.

    opened by mct10 0
  • Implement static method Element.of(html)

    Implement static method Element.of(html)

    I implement Element.of(html) mentioned in #1411 by using method element.html(), while using a <div> to be the top element can succeed in most situation, but can not hold cases like input tag <html>. I will glad to work more on this and welcome your code reviews.

    opened by suarez12138 0
Releases(jsoup-1.14.2)
  • jsoup-1.14.2(Aug 15, 2021)

    Caught by the fuzz! jsoup 1.14.2 is out now, and includes a set of parser bug fixes and improvements for handling rough HTML and XML, as identified by the Jazzer JVM fuzzer. This release also includes other fixes and improvements.

    See the release announcement for the full changelog.

    Source code(tar.gz)
    Source code(zip)
  • jsoup-1.14.1(Jul 9, 2021)

    jsoup 1.14.1 is out now, with simple request session management, increased parse robustness, and a ton of other improvements, speed-ups, and bug fixes.

    See the full announcement for all the details on what's changed.

    Source code(tar.gz)
    Source code(zip)
  • jsoup-1.13.1(Feb 29, 2020)

    jsoup 1.13.1

    See the release notes.

    <dependency>
      <!-- jsoup HTML parser library @ https://jsoup.org/ -->
      <groupId>org.jsoup</groupId>
      <artifactId>jsoup</artifactId>
      <version>1.13.1</version>
    </dependency>
    
    Source code(tar.gz)
    Source code(zip)
  • jsoup-1.12.2(Feb 8, 2020)

Owner
Jonathan Hedley
Hacker, author of jsoup, principal solution architect in computer vision at AWS. Opinions are my own.
Jonathan Hedley
Nokogiri (鋸) is a Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.

Nokogiri Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writ

Sparkle Motion 5.7k Sep 9, 2021
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR v4 Build status ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating

Antlr Project 10.6k Sep 13, 2021
A scalable web crawler framework for Java.

Readme in Chinese A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persiste

Yihua Huang 10k Sep 19, 2021
jQuery-like cross-driver interface in Java for Selenium WebDriver

seleniumQuery Feature-rich jQuery-like Java interface for Selenium WebDriver seleniumQuery is a feature-rich cross-driver Java library that brings a j

null 70 Jun 15, 2021
Automated driver management for Selenium WebDriver

WebDriverManager is a library which allows to automate the management of the drivers (e.g. chromedriver, geckodriver, etc.) required by Selenium WebDr

Boni García 1.7k Sep 13, 2021
Open Source Web Crawler for Java

crawler4j crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-thr

Yasser Ganjisaffar 4.2k Sep 10, 2021
This is public repository for Selenium Learners at TestLeaf

Selenium WebDriver Course for March 2021 Online Learners This is public repository for Selenium Learners at TestLeaf. Week1 - Core Java Basics How Jav

TestLeaf 74 Sep 18, 2021
Concise UI Tests with Java!

Selenide = UI Testing Framework powered by Selenium WebDriver What is Selenide? Selenide is a framework for writing easy-to-read and easy-to-maintain

Selenide 1.4k Sep 18, 2021
Elegant parsing in Java and Scala - lightweight, easy-to-use, powerful.

Please see https://repo1.maven.org/maven2/org/parboiled/ for download access to the artifacts https://github.com/sirthias/parboiled/wiki for all docum

Mathias 1.2k Sep 4, 2021
An implementation of darcy-web that uses Selenium WebDriver as the automation library backend.

darcy-webdriver An implementation of darcy-ui and darcy-web that uses Selenium WebDriver as the automation library backend. maven <dependency> <gr

darcy framework 20 Aug 22, 2020