python爬虫基础篇：常用的数据清洗

2 阅读 0 评论 0 点赞

Python操作字符串

1、根据任意多的分隔符操作字符串

re.split()

 str="sre##size##hello##490##"
 print(str.split("##"))
 输出结果：['sre', 'size', 'hello', '490', '']

2、字符串连接合并

Str.join()

 str="sre##size##hello##490##"
 print("".join(str.split("##")))
 # 输出结果：sresizehello490
 
 print("+".join(str.split("##")))
 # 输出结果  sre+size+hello+490+

3、字符串多次替换

str.translate() str.replace()

 str="sre##size##hello##490##"
 intab = "sll"  # 替换前词
 ontab = "SLL"   # 替换后词
 trantab = str.maketrans(intab,ontab)  
 print(str.translate(trantab))   
 # 输出结果：Sre##Size##heLLo##490##

4、在字符串开头或结尾做文本匹配

 srr = ["my.py","you,txt","the.xml","that.wd","her.py"]
 for i in srr:
     if i.endswith(".py"):
         print(i)
 # 输出结果： my.py
 #          her.py
     if i.startswith("you"):
         print(i)
 # 输出结果 ：you,txt

5、去除不需要的字符

str.strip()

 str="sresizehello490\n##"
 print(str.strip("\n##"))
 # 输出结果：sresizehello490

6、格式化替换

 name = 'wy'
 age = 21
 str = f"name:{name},age:{age}"
 print(str)
 # 输出结果：name:wy,age:21

编码与解码

Ascii：对应英文256字符

gbk：对应中英文65536字符

utf-8：万国码，可变字长（最常用）

网址使用什么编码，爬虫用什么编码（“charget”）

本站资源均来自互联网，仅供研究学习，禁止违法使用和商用，产生法律纠纷本站概不负责！如果侵犯了您的权益请与我们联系！

转载请注明出处：免费源码网-免费的源码资源网站 » python爬虫基础篇：常用的数据清洗

点赞(0) 打赏

本文分类：文章资讯
本文标签：python爬虫基础篇：常用的数据清洗
浏览次数：2 次浏览
本文链接：https://freeymw.com/article/34128.html

上一篇 > 李沐读论文-启发与借鉴-3：Attention is all you need
下一篇 > java图书进销存管理系统源码(springboot)

评论列表共有 0 条评论

暂无评论

python爬虫基础篇：常用的数据清洗

Python操作字符串

编码与解码

评论列表 共有 0 条评论

发表评论 取消回复

评论列表共有 0 条评论

发表评论取消回复