首页 > 全部内容 > 编程书籍 > Python爬虫库(2)
2017
08-30

Python爬虫库(2)

  • General
  • urllib – network library (stdlib)
  • requests – network library
  • grab – network library (pycurl based)
  • pycurl – network library (binding to libcurl)
  • urllib3 – Python HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more.
  • httplib2 – network library
  • RoboBrowser – A simple, Pythonic library for browsing the web without a standalone web browser.
  • MechanicalSoup – A Python library for automating interaction with websites.
  • mechanize – Stateful programmatic web browsing.
  • socket low-level networking interface (stdlib)
  • Unirest for Python – Unirest is a set of lightweight HTTP libraries available in multiple languages
  • hyper – HTTP/2 Client for Python
  • PySocks – Updated and actively maintained version of SocksiPy, with bug fixes and extra features. Acts as a drop-in replacement to the socket module.

Asynchronous

  • treq – requests like API (twisted based)
  • aiohttp – http client/server for asyncio (PEP-3156)

Web-Scraping Frameworks

  • Full Featured Crawlers
  • grab – web-scraping framework (pycurl/multicurl based)
  • scrapy – web-scraping framework (twisted based). Does not support Python3.
  • pyspider – A powerful spider system.
  • cola – A distributed crawling framework.

Other

  • portia – Visual scraping for Scrapy.
  • restkit – HTTP resource kit for Python. It allows you to easily access to HTTP resource and build objects around it.
  • demiurge – PyQuery-based scraping micro-framework.

HTML/XML Parsing

  • General
  • lxml – effective HTML/XML processing library. Supports XPATH. Written in C.
  • cssselect – working with DOM tree with CSS selectors
  • pyquery – working with DOM tree with jQuery-like selectors
  • BeautifulSoup – slow HTML/XMl processing library, written in pure python
  • html5lib – builds DOM of HTML/XML document according to WHATWG spec. That spec is used in all modern browsers.
  • feedparser – parsing of RSS/ATOM feeds.
  • MarkupSafe – Implements a XML/HTML/XHTML Markup safe string for Python.
  • xmltodict – Working with XML feel like you are working with JSON.
  • xhtml2pdf – HTML/CSS to PDF converter.
  • untangle – Converts XML documents to Python objects for easy access.

Sanitizing

  • Bleach – cleaning of HTML (requires html5lib)
  • sanitize – Bringing sanity to world of messed-up data.

Text Processing

Libraries for parsing and manipulating plain texts.

  • General
  • difflib – (Python standard library) Helpers for computing deltas.
  • Levenshtein – Fast computation of Levenshtein distance and string similarity.
  • fuzzywuzzy – Fuzzy String Matching.
  • esmre – Regular expression accelerator.
  • ftfy – Makes Unicode text less broken and more consistent automagically.

Transliteration

  • unidecode – ASCII transliterations of Unicode text.

Character encoding

  • uniout – Print readable chars instead of the escaped string.
  • chardet – Python 2/3 compatible character encoding detector.
  • xpinyin – A library to translate Chinese hanzi (漢字) to pinyin (拼音).
  • pangu.py – Spacing texts for CJK and alphanumerics.

Slugify

  • awesome-slugify – A Python slugify library that can preserve unicode.
  • python-slugify – A Python slugify library that translates unicode to ASCII.
  • unicode-slugify – A slugifier that generates unicode slugs.
  • pytils – Simple tools for processing strings in russian (including pytils.translit.slugify)

General Parser

  • PLY – Implementation of lex and yacc parsing tools for Python
  • pyparsing – A general purpose framework for generating parsers.

Human names

Phone Number

  • phonenumbers – Parsing, formatting, storing and validating international phone numbers.

User-agent string

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

  • General
  • tablib – A module for Tabular Datasets in XLS, CSV, JSON, YAML.
  • textract – Extract text from any document, Word, PowerPoint, PDFs, etc.
  • messytables – Tools for parsing messy tabular data
  • rows – A common, beautiful interface to tabular data, no matter the format (currently CSV, HTML, XLS, TXT — more coming!)

Office

  • python-docx – Reads, queries and modifies Microsoft Word 2007/2008 docx files.
  • xlwt / xlrd – Writing and reading data and formatting information from Excel files.
  • XlsxWriter – A Python module for creating Excel .xlsx files.
  • xlwings – A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
  • openpyxl – A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
  • Marmir – Takes Python data structures and turns them into spreadsheets.

PDF

  • PDFMiner – A tool for extracting information from PDF documents.
  • PyPDF2 – A library capable of splitting, merging and transforming PDF pages.
  • ReportLab – Allowing Rapid creation of rich PDF documents.
  • pdftables – Extract tables from PDF files directly

Markdown

  • Python-Markdown – A Python implementation of John Gruber’s Markdown.
  • Mistune – Fastest and full featured pure Python parsers of Markdown.
  • markdown2 – A fast and complete Python implementation of Markdown

YAML

  • PyYAML – YAML implementations for Python.

CSS

  • cssutils – A CSS library for Python.

ATOM/RSS

SQL

  • sqlparse – A non-validating SQL parser.

HTTP

  • http-parser – HTTP request/response parser for python in C

Microformats

  • opengraph – A Python module to parse the Open Graph Protocol tags

Portable Executable

  • pefile – A multi-platform module to parse and work with Portable Executable (aka PE) files.

PSD

Natural Language Processing

Libraries for working with human languages.

  • NLTK – A leading platform for building Python programs to work with human language data.
  • Pattern – A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
  • TextBlob – Providing a consistent API for diving into common NLP tasks. Stands on the giant shoulders of NLTK and Pattern.
  • jieba – Chinese Words Segmentation Utilities.
  • SnowNLP – A library for processing Chinese text.
  • loso – Another Chinese segmentation library.
  • genius – A Chinese segment base on Conditional Random Field.
  • langid.py – Stand-alone language identification system.
  • Korean – A library for Korean morphology.
  • pymorphy2 – Morphological analyzer (POS tagger + inflection engine) for Russian language.
  • PyPLN – A distributed pipeline for natural language processing, made in Python. he goal of the project is to create an easy way to use NLTK for processing big corpora, with a Web interface.

Browser automation and emulation

  • Browsers
  • selenium – automating real browsers (Chrome, Firefox, Opera, IE)
  • Ghost.py – wrapper of QtWebKit (requires PyQT)
  • Spynner – wrapper of QtWebKit QtWebKit (requires PyQT)
  • Splinter – univeral API to browser emulators (selenium webdrivers, django client, zope)

Headless tools

  • xvfbwrapper – Python wrapper for running a display inside X virtual framebuffer (Xvfb)

Multiprocessing

  • threading – standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
  • multiprocessing – standard python library to run processes.
  • celery – An asynchronous task queue/job queue based on distributed message passing.
  • concurrent-futures – The concurrent.futures module provides a high-level interface for asynchronously executing callables.

Asynchronous

Libraries for asynchronous networking programming.

  • asyncio – (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
  • Twisted – An event-driven networking engine.
  • Tornado – A Web framework and asynchronous networking library.
  • pulsar – Event-driven concurrent framework for Python.
  • diesel – Greenlet-based event I/O Framework for Python.
  • gevent – A coroutine-based Python networking library that uses greenlet.
  • eventlet – Asynchronous framework with WSGI support.
  • Tomorrow – Magic decorator syntax for asynchronous code.

Queue

  • celery – An asynchronous task queue/job queue based on distributed message passing.
  • huey – Little multi-threaded task queue.
  • mrq – Mr. Queue – A distributed worker task queue in Python using Redis & gevent.
  • RQ – lightweight task queue manager based on redis
  • simpleq – A simple, infinitely scalable, Amazon SQS based queue.
  • python-gearman – python API for Gearman

Cloud Computing

  • picloud – executing python-code in cloud
  • dominoup.com – executing R, Python и matlab code in cloud

Email

Libraries for parsing email.

  • flanker – A email address and Mime parsing library.
  • Talon – Mailgun library to extract message quotations and signatures.

URL and Network Address Manipulation

Libraries for parsing/modifying URLs and network addresses.

  • URL
  • furl – A small Python library that makes manipulating URLs simple.
  • purl – A simple, immutable URL class with a clean API for interrogation and manipulation.
  • urllib.parse – interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.” (stdlib)
  • tldextract – Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

Network Address

  • netaddr – A Python library for representing and manipulating network addresses.

Web Content Extracting

Libraries for extracting web contents.

  • Text and Meta Data from HTML pages
  • newspaper – News extraction, article extraction and content curation in Python.
  • html2text – Convert HTML to Markdown-formatted text.
  • python-goose – HTML Content/Article Extractor.
  • lassie – Web Content Retrieval for Humans.
  • micawber – A small library for extracting rich content from URLs.
  • sumy – A module for automatic summarization of text documents and HTML pages.
  • Haul – An Extensible Image Crawler.
  • python-readability – Fast Python port of arc90’s readability tool.
  • scrapely – Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

Video

  • youtube-dl – A small command-line program to download videos from YouTube.
  • you-get – A YouTube/Youku/Niconico video downloader written in Python 3.

Wiki

  • WikiTeam – Tools for downloading and preserving wikis.

WebSocket

Libraries for working with WebSocket.

  • Crossbar – Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
  • AutobahnPython – WebSocket & WAMP for Python on Twisted and asyncio.
  • WebSocket-for-Python – WebSocket client and server library for Python 2 and 3 as well as PyPy.

DNS Resolving

  • dnsyo – Check your DNS against over 1500 global DNS servers.
  • pycares – interface to c-ares. c-ares is a C library that performs DNS requests and name resolutions asynchronously

Computer Vision

  • OpenCV – Open Source Computer Vision Library.
  • SimpleCV – Concise, readable interface for cameras, image manipulation, feature extraction, and format conversion (based on OpenCV).
  • mahotas – fast computer vision algorithms (all implemented in C++) operating over numpy arrays.

Proxy Server

  • shadowsocks – A fast tunnel proxy that helps you bypass firewalls (TCP & UDP support, User management API, TCP Fast Open, Workers and graceful restart, Destination IP blacklist)
  • tproxy – tproxy is a simple TCP routing proxy (layer 7) built on Gevent that lets you configure the routine logic in Python

Misc

  • user_agent – this module is for generating random, valid web navigator’s configs & User-Agent HTTP headers.

Other python lists

最后编辑:
作者:Null
这个作者貌似有点懒,什么都没有留下。