pandas提供了iterrows()、itertuples()、apply等行遍历的方式,还是比较方便的。
polars的列操作功能非常强大,这个在其官网上有详细的介绍。由于polars底层的arrow是列存储模式,行操作效率低下,官方也不推荐以行方式进行数据操作。但是还是有部分场景可能会用到行遍历的情况。

polars如何进行行遍历,今天尝试一下非apply的方式。

场景:polars读取相应的关于历史股价的csv文件,其中有基本的行情信息,那么,如何对读取到的文件进行快速的行遍历?这种场景在行情驱动的策略回测中比较常见。

在这里插入图片描述一、初步方案:

1、总体方案

1、csv => dataframe 
2、dataframe =>into_struct ,得到structchunked
3、struchchunked =>在bars进行行遍历。

2、Bar类型
至于Bar类型的设计,存在两种方案:

(1)值类型的Bar

#[warn(dead_code)]
struct Bar{
    code:String,
    date:String,
    open:f32,
    high:f32,
    close:f32,
    low:f32,
    volume:f32,
    amount:f32,
    is_fq:bool,
}

(2)有引用类型的Bar

#[warn(dead_code)]
struct Bar2<'a>{
    code:&'a str,
    date:&'a str,
    open:f32,
    high:f32,
    close:f32,
    low:f32,
    volume:f32,
    amount:f32,
    is_fq:bool,
}

二、toml

注意,polars对features的设置要求高,有些用到的特性需要准确打开,否则代码编译会通不过。这一点在polars文档中经常没有写清楚,也算是一个坑。

[package]
name = "my_duckdb"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
polars = { version = "*", features = ["lazy","dtype-struct"] }

注意,features中,一定要加上"dtype-struct"。

三、main.rs

根据上面的设计,全部代码如下:

use polars::prelude::*;
use std::time::Instant;

#[warn(dead_code)]
struct Bar{
    code:String,
    date:String,
    open:f32,
    high:f32,
    close:f32,
    low:f32,
    volume:f32,
    amount:f32,
    is_fq:bool,
}
#[warn(dead_code)]
struct Bar2<'a>{
    code:&'a str,
    date:&'a str,
    open:f32,
    high:f32,
    close:f32,
    low:f32,
    volume:f32,
    amount:f32,
    is_fq:bool,
}
fn main() {
    let time0 = Instant::now();
    // test2.csv:64w行
    let csv = "test2.csv"; 
    let df = polars_lazy_read_csv(csv);
    println!("read raw csv cost time : {:?} seconds",time0.elapsed().as_secs_f32());
    let time1 = Instant::now();
    let rows = df.into_struct("bars");
    println!("dataframe => structs cost time : {:?} seconds",time1.elapsed().as_secs_f32());
    let time2 = Instant::now();
    let bars = get_vec_bars(&rows);
    println!("dataframe => bars cost time : {:?} seconds",time2.elapsed().as_secs_f32());
    let time3 = Instant::now();
    let bar2s = get_vec_bar2s(&rows);
    println!("dataframe => bar2s cost time : {:?} seconds",time3.elapsed().as_secs_f32());
    println!("bars length :{:?}",bars.len());
    println!("bar2s length:{:?}",bar2s.len());
}

fn get_bar(row:&[AnyValue])->Bar{
    let code = row.get(0).unwrap();
    let mut new_code = "";
    if let &AnyValue::Utf8(value) = code{
        new_code = value;
    }
    let mut new_date = ""; 
    let date = row.get(1).unwrap();
    if let &AnyValue::Utf8(v) = date {
       new_date = v;
    }
    let open =row[2].extract::<f32>().unwrap();
    let high:f32 = row[3].extract::<f32>().unwrap();
    let close =row[4].extract::<f32>().unwrap();
    let low:f32 = row[5].extract::<f32>().unwrap();
    let volume =row[6].extract::<f32>().unwrap();
    let amount:f32 = row[7].extract::<f32>().unwrap();
    let mut is_fq = false;
    if let &AnyValue::Boolean(b) = row.get(8).unwrap(){
        is_fq = b;
    }
    let bar = Bar{
        code: String::from(new_code),
        date: String::from(new_date),
        open:open,
        high:high,
        close:close,
        low:low,
        volume:volume,
        amount,
        is_fq:is_fq,
    };
    bar
}

fn get_bar2<'a>(row:&'a [AnyValue])->Bar2<'a>{
    let code = row.get(0).unwrap();
    let mut new_code = "";
    if let &AnyValue::Utf8(value) = code{
        new_code = value;
    }
    let mut new_date = ""; 
    let date = row.get(1).unwrap();
    if let &AnyValue::Utf8(v) = date {
       new_date = v;
    }
    let open =row[2].extract::<f32>().unwrap();
    let high:f32 = row[3].extract::<f32>().unwrap();
    let close =row[4].extract::<f32>().unwrap();
    let low:f32 = row[5].extract::<f32>().unwrap();
    let volume =row[6].extract::<f32>().unwrap();
    let amount:f32 = row[7].extract::<f32>().unwrap();
    let mut is_fq = false;
    if let &AnyValue::Boolean(b) = row.get(8).unwrap(){
        is_fq = b;
    }
    let bar = Bar2{
        code: new_code,
        date: new_date,
        open:open,
        high:high,
        close:close,
        low:low,
        volume:volume,
        amount,
        is_fq:is_fq,
    };
    bar
}
fn get_vec_bars(data: &StructChunked)-> Vec<Bar>{
    let mut bars = Vec::new();
    for row in data{
        let bar = get_bar(row);
        bars.push(bar);
    }
    bars
}

fn get_vec_bar2s(data: &StructChunked)-> Vec<Bar2>{
    let mut bars = Vec::new();
    for row in data{
        let bar = get_bar2(row);
        bars.push(bar);
    }
    bars
}
fn polars_lazy_read_csv(filepath:&str) ->DataFrame{
    let polars_lazy_csv_time  = Instant::now();
    let p = LazyCsvReader::new(filepath)
    .has_header(true)
    .finish().unwrap();
    let mut df = p.collect().expect("error to dataframe!");
    println!("polars lazy 读出csv的行和列数:{:?}",df.shape());
    println!("polars lazy 读csv 花时: {:?} 秒!", polars_lazy_csv_time.elapsed().as_secs_f32());
    df
}

四、输出与比较
对于一个64万行,9列的csv文件,需要遍历转换Vec< Bar >类型,
1、输出如下:

polars lazy 读出csv的行和列数:(640710, 9)
polars lazy 读csv 花时: 0.058484446 秒!
read raw csv cost time : 0.058487203 seconds
dataframe => structs cost time : 2.8842e-5 seconds
dataframe => bars cost time : 0.131985 seconds
dataframe => bar2s cost time : 0.10357016 seconds
bars length :640710
bar2s length:640710

总体上看,从dataframe到struct这层,效率比较高,主要的时间花在了structchunked至bars这部分上面。

2、值类型Bar和引用类型Bar

从输出结果,可以看出,引用类型的Bar的效率要高一些,提效了20%。因为减少了堆分配所需要的时间。

五、其它

polars目前还没有发现有类似pandas的行遍历的方式,后面将持续跟踪。
此外,dataframe转bars的效率并不高,期待找到更高效的方式替代。

点赞(0) 打赏

评论列表 共有 0 条评论

暂无评论

微信公众账号

微信扫一扫加关注

发表
评论
返回
顶部