python 爬虫 selenium 笔记

23 阅读 0 评论 0 点赞

todo

阅读并熟悉 Xpath, 这个与 Selenium 密切相关、

selenium

selenium 加入无图模式，速度快很多。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# selenium 无图模式，速度快很多。
option = Options()
option.page_load_strategy = "none"
prefs = {"profile.managed_default_content_settings.images": 2}  # 设置无图模式
option.add_experimental_option("prefs", prefs)  # 加载无图模式设置

driver = webdriver.Chrome(chrome_options=option)

遇到 BeautifulSoup iframe

一种解决方案是，获得iframe的src属性，然后请求并解析其内容:
另一种是：

driver.get(url)
iframe = driver.find_elements_by_tag_name('iframe')[1]
driver.switch_to.frame(iframe) # 最重要的一步
soup = BeautifulSoup(driver.page_source, "html.parser")

个人常犯的错误，误区，陷阱

driver.execute_script(JS) 这个才是执行 JS，
注意是 execute_script, 不是 execute。

页面等待。这个是比较关键的。

显式等待。貌似比较麻烦，且不常用。

from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID,'someid')))

隐式等待。推荐使用。

driver.implicitly_wait(10) # seconds

定位元素

定位元素之前，加上这句话，笔记安全。

bot.implicitly_wait(10) # 这句话很关键。

查找元素的方法

find_element_by_id()
find_element_by_name()              # 这个name 是标签里面的一种属性。
find_element_by_xpath()             
find_element_by_link_text()         # 比如  'Sign In'
find_element_by_partial_link_tex()      
find_element_by_tag_name()
find_element_by_class_name()
find_element_by_css_selector()

基本配置，导包

import os
import random
import json
import pickle
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import pyautogui as pt
import pyperclip

切换frame

遇到 iframe，最好是切换过去，见 https://blog.csdn.net/huilan_same/article/details/52200586

driver.switch_to.frame(0) # 1.用frame的index来定位，第一个是0

点击元素。不可点击的元素, 执行下面的方法。

def real_click(self, driver, ele):
    actions = ActionChains(driver)
    actions.move_to_element(ele)
    actions.click(ele)
    actions.perform()

执行 js, 页面滚动

# 先滚动到底部，然后再滚动到顶部
# window.scrollTo(0,document.body.scrollHeight);

js = "var q=document.documentElement.scrollTop=500"
bot.execute_script(js)

js2 = "document.body.scrollTop=document.documentElement.scrollTop=0;"
bot.execute_script(js2)

填写表格。这个需要再读读看。

element = driver.find_element_by_xpath("//select[@name='name']")
choices = element.find_elements_by_tag_name("option")
for c in choices:
    print("Value is: %s" % c.get_attribute("value"))
    c.click()

封装一些自己常用的方法

@staticmethod
def save_html(bot):             # 保存 html
    filename = 'ret.html'
    data = bot.page_source
    with open(filename, 'w') as f:
        f.write(data)
    print("保存 html 完成!")

@staticmethod
def real_click(driver, ele):    # 点击元素
    actions = ActionChains(driver)
    actions.move_to_element(ele)
    actions.click(ele)
    actions.perform()

@staticmethod
def send_word(ele, word):       # 输入框，输入文字
    ele.clear()
    ele.send_keys(word)
    ele.send_keys(Keys.RETURN)

源码中有趣的，有用的方法

Driver

driver.current_url # 本身就是静态方法
driver.page_source
driver.save_screenshot(‘foo.png’)
driver.get_log(‘driver’)
driver.page_source # 保存 html 源码，功本地调试，减少网络请求
driver.title 直接获取页面的标题，很适合作为文件名。

WebElement

ele.id # 直接就可以用
ele.get_attribute(“class”) # 这个很常用的。

个人接单，python, R语言，有事请私聊

老哥，支持一下啊。

支付宝扫码领红包哦

本站资源均来自互联网，仅供研究学习，禁止违法使用和商用，产生法律纠纷本站概不负责！如果侵犯了您的权益请与我们联系！

转载请注明出处：免费源码网-免费的源码资源网站 » python 爬虫 selenium 笔记

点赞(0) 打赏

本文分类：文章资讯
本文标签：python 爬虫 selenium 笔记
浏览次数：23 次浏览
本文链接：https://freeymw.com/article/28879.html

上一篇 > ADB 安装教程：如何在 Windows、macOS 和 Linux 上安装 Android Debug Bridge
下一篇 > 蓝桥杯算法之暴力

评论列表共有 0 条评论

暂无评论

python 爬虫 selenium 笔记

todo

selenium

个人常犯的错误， 误区，陷阱

页面等待。这个是比较关键的。

定位元素

基本配置，导包

切换frame

点击元素。不可点击的元素, 执行下面的方法。

执行 js, 页面滚动

填写表格。这个需要再读读看。

封装一些自己常用的方法

源码中有趣的，有用的方法

个人接单，python, R语言，有事请私聊

老哥，支持一下啊。

评论列表 共有 0 条评论

发表评论 取消回复

个人常犯的错误，误区，陷阱

评论列表共有 0 条评论

发表评论取消回复