The weird aggressive aocl optimization "removing unnecessary storage to local memory" - The weird aggressive aocl optimization "removing unnecessary storage to local memory" Hello, I used local memory variables in my kernels but got many compilation warnings like this when compiling them with aocl on Intel FPGA arria10. When the kernels are compiled into the task type (single work item), their running cannot give correct results. However, if I used global memory variables instead of these local memory ones, the results are always correct. When the kernels are compiled into the NDRange type, the running always show correct results (it doesn't matter which type of variable (the local memory vs global memory) is used) So I was wondering if it is possible for such aggressive optimization to affect the correctness of the calculation. (I checked my code again and again. Logically, the store statements that cause this warning should not be removed; otherwise the emulation definitely will give the wrong results). Does anyone also encounter this warning or know if it has any impact on the program semantics? Thanks! Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Great reading materials! I'll read the optimized code of rodia_fpga and the paper in arxiv carefully. They will improve my understanding of the optimizations. Also thank you for telling me your experience about function verification and how you emulated the kernels. Good practices can help avoid much unnecessary trouble. I'll update my work's status here once I make significant progress. Finally, thank you again! Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" With respect to functional verification, what I do is that I construct my host code in a way that both run-time and offline compilation are supported, the latter for FPGAs and the former for other devices, and I use AMD's OpenCL SDK for other devices. In this case, as long as the run-time OpenCL driver is installed, the same host code can then be used to execute the same kernel on any type of CPU, GPU or FPGA. You can take a look at the host code/makefiles of the optimized benchmarks in the following repository as example of achieving this: https://github.com/fpga-opencl-benchmarks/rodinia_fpga I emulated all of those kernels on CPUs/GPUs using the same host and kernel codes. What I would tell you is that if an NDRange kernel with sufficiently large local and global size performs correctly on a GPU, it should also perform correctly on an FPGA (unless there is a bug in the FPGA compiler). A CPU should also work fine even if the whole kernel runs on one core, since there will still be multiple threads (work-items) running on that core that could be issued out of order and this is usually enough to show concurrency issues but a GPU would likely be more trustworthy in this case. With respect to, let's say HDL vs. OpenCL, many old-school HDL programmers tend to think that OpenCL or HLS tools in general are insufficient and it is possible to achieve better results using HDL. This is indeed true in some cases like latency-sensitive or low-power applications where clock-by-clock control over the code is required, or applications that are limited by logic resources, but I would not say this is the case for high-throughput applications where limitation is Memory/PCI-E bandwidth or DSP count since these limitations are independent of the programming language. With respect to the particular case of unpipelinable nested loops, HDL or OpenCL would not make a difference. If you have a regular outer loop with an irregular inner loop, the outer loop cannot be pipelined; it doesn't matter how you "describe" the code. There are two ways to approach such loops on FPGAs: 1- Use NDRange and let the run-time work-item scheduler do its best in maximizing pipeline efficiency and minimizing the average loop II . 2- Collapse the nested loop as long as it is not too irregular and get an II of one at the cost of a noticeable Fmax hit. Though by "collapse" I mean manual collapse and not the compiler's "coalesce" pragma. Take a look at Section 3.2.4.3 in this document: https://arxiv.org/abs/1810.09773 Even though the provided example involves collapsing a regular nested loop, this optimization also sometimes applies to irregular nested loops. I such case, the condition inside the collapsed loop that is used to increment the variable of the original outer loop will have more than one statement (which complicates the critical path and reduces the Fmax). Indeed the possibility also exists to implement parts of your application in HDL and use it as an HDL library in an OpenCL kernel but you are going to run into complications if your HDL library does not have a fixed latency and I highly doubt you would be able to achieve much better results in the end. Finally, with respect to NDRange vs. Single Work-item, I recommend reading Section 3.1 (and particularly 3.1.4) of the document I posted above. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" @HRZ ​ Thank you for so specific code guidance and suggestions ! I'll look through my code again with your method 1 and 2. ​ Speaking of concurrency, actually my project has three different implementation versions for now: "CPU Serial", "OpenCL CPU" and the one I showed here "OpenCL FPGA". The "CPU Serial" is used to generate the correct results for comparison with the other two versions; the "OpenCL CPU" is used to detect potential concurrency bugs or problems. ​ I forgot to mention in last reply that another important reason why I remove all barriers in my code is: only doing so makes all workitems in the "OpenCL CPU" run on DIFFERENT CPU cores. On the one hand, Intel's OpenCL runtime driver for CPU views a core as a "Compute Unit (CU)" and puts work items in the same workgroup into one core. On the other hand, as we all know, OpenCL barriers are only applied within one workgroup. As a result, my earlier "OpenCL CPU" versions (one single big kernel, one group) with many barriers only can run on a single core, and no concurrent running happened! But if I put workitems into different groups with each group containing only one item, the results became incorrect because the barrier cannot play role. Therefore, I decided to break that big single kernel into multiple ones and meanwhile remove all barriers. Then it worked stable and well and always gives the correct results in my own computers. (Note that Intel's OpenCL CPU runtime uses Intel TBB as its underlying mechanism to create work items (threads)). (I also think putting a group into one core is a limitation of Intel OpenCL CPU runtime itself, not the OpenCL itself. I have not tried other CPU type like AMD's). ​ So this is why I did not pay much attention to the concurrency issue. I have not implemented an OpenCL GPU version yet. Considering both CPU and GPU are "instruction-decoding-based" execution style, I feel that there should be no concurrency issue too on GPU if my CPU version can work correctly. I may be wrong, of course. If you think this (my thinking) is not correct, I'll consider also implementing it on GPU and it should be not difficult. Unlike CPU and GPU, FPGA implements its concurrency using direct "circuit-based" execution style. So maybe there are still chance for concurrency issues to happen on FPGA. In other words, there may be no "concurrency-test portability" among CPU, GPU and FPGA. ​ I totally agree with you about how to apply applications on FPGA. Not all applications are suitable for FPGA acceleration. But here let me first clarify some concepts before further discussion: "OpenCL-based FPGA" vs "FPGA itself". One faculty in our group thinks it is the problem of the OpenCL compiler that cannot generate a high-efficient circuit, not the problem of the FPGA itself; using a low level RTL implementation may solve this problem. ​ Take my case for example, the "(de)compression" kernel actually is a big top-level loop which contains many small loops, many of which are IRREGULAR . By "irregular", I mean these loops do not have fixed number of iterations. As you said, they have "variable loop exit conditions" that are determined by the runtime data. So the top-level loop will be never pipelined by the current compiler. (It seems also difficult to break this big top loop into multiple parts according to the code's semantics) Do you think that it is possible to handle such irregular cases including other cases you mentioned with a RTL implementation without considering an OpenCL compiler? (In my opinion, the issues caused by program logic or semantics are intrinsic and hard to solve by changing the implementation) ​ Actually now I'm considering two solutions: 1) implement these irregular loops with RTL language and then call it by OpenCL Library (Intel compiler supports this). But I'm not very sure if it is feasible (especially in the statement-level, like call a API which is a RTL implementation of an irregular loop in somewhere inside a kernel). I'm concerned that the top level loop might still not be pipelined even it is feasible; 2) Implement the whole (de)compression function with RTL. Undoubtedly this will take much more time. ​ Though Intel's compiler supports both task and NDRange type, it seems that Intel focuses more on the former and implicitly suggests people to use the former by which more aggressive optimizations can be conducted (according to the manuals "Programming Guide" and "The Best Practices"). In my case, a SIMD-style optimization instead of a pipelined one should be more appropriate because multiple work items access non-overlapping data. But even for NDRange one, Intel still uses a pipeline to implement it across work items. Only their "kernel Vectorization" is real SIMD-style, but it only targets the kernel level (not the statement level.). Another optimization "Multiple Compute Units" is obviously not feasible for my case. A pair of "decomp" and "compress" kernels already occupy ~50% resource. From this perspective, the "instruction-decoding based" execution style is not limited by your code's logic complexity and size. That's why thousands of simple cores can be put into a GPU to do massive SIMD-style processing. So I think pipeline is the main advantage of FPGA, which is what both CPU and GPU lack (Note that the microarchitecture pipeline inside the CPU is specifically for instruction level parallelism, not the function-level pipeline we are talking about in the FPGA). It seems impractical to achieve a comparable performance to GPU by replicating function units (especially for those complex ones). PS: Quartus 18.1 and 19.1 provide many example OpenCL designs (in 18.1/hld/examples_aoc, and 19.1/hld/examples_aoc, respectively). I looked through them quickly and noticed that most loops are regular. Finally, I totally understand that you are busy with your projects and don't have much time. That's fine and I still appreciate your kindness and your advice! Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" @hiratz ​ This is the best way I can simplify the problem of local memory and barriers: 1- Is there any instance in your code where work-item "i" writes to point X in a local buffer and work-item "j" reads from that point? (based on your last reply it seems the answer is no) Yes: You need a local memory barrier after every such write operation. --> END No: You do not need to use local memory. --> GOTO 2 2- Is there any instance in your code where work-item "i" reads point Y from global memory multiple times? (based on your last reply it seems the answer is yes) Yes: Then create a private variable (rather than local) to store the point and reuse it. The size of this variable will depend on how many such points you need to store on-chip per work-item. You do not need to account for the work-group size or SIMD factor in this case. The compiler will automatically create one such buffer for as many work-items as your SIMD factor. --> END No: There is no point in optimization using local/private memory in this case. --> END Getting correct results with one work-item but incorrect results with more than one sounds very much like a concurrency issue to me. If there is no overlapping between the regions that the work-items access in global memory, then the problem must come from local memory or incorrect synchronization. Of course there are also cases where the compiler generates incorrect logic due to some compiler bug, but such cases are quite rare. Considering the fact that you are not using channels (or any other FPGA-specific constructs), have you tried running your modified code on a GPU? I personally would not use Intel's emulator unless I am using such constructs. Running on a GPU is not only much faster, but also allows debugging concurrency issues. Finally, I wouldn't mind academic collaborations but right now I am occupied by my own projects and since I have little knowledge of the code you are trying to port and, to be honest, it looks quite big, I am not sure if I could allocate enough time for such collaboration. One advice I have for you is to first make sure porting this code for FPGAs is worth the time you are spending on it. If it ends up being too slow since the algorithm is not suitable for FPGA acceleration, it would be difficult to justify the time and effort spent on it. If the code is memory-intensive/memory-bound, if it largely involves random or indirect memory accesses, or if it cannot be properly pipelined due to loop-carried dependencies or variable loop exit conditions, I wouldn't say the code is a good candidate for FPGA acceleration. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Done. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Yup.. thanks Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Do you mean opening a new thread for the "-fast-compile" problem above? If so, I will delete it and open a new question for it. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Hi, Can you opened a new thread to address this differently? Thanks Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Moved. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" The Intel's Harp environment is documented at: https://wiki.intel-research.net/FPGA.html To apply for access, you can fill out the form at: https://registration.intel-research.net/register Once you get an account, you can send emails to iam@intel-research.net for help requests. (Actually I already asked about the bsp 1.2's installation on Intel's Harp community and they replied they probably will update it soon) Update about my progress: I successfully compiled the code for both ndrange type and task type. The former took 1:42 hours (1 hour, 42 mins) and the latter took 1:29 hours. Please note: for the task type compilation, __attribute__((task)) is not anymore supported in 19.1. You have to replace it with __attribute__((max_global_work_dim(0))). In addition, though 17.1.1 allows the code like "get_global_id()" or "get_global_size()" to stay in the code when compiling a task type version, 19.1 does not allow doing so. If you forcibly do so, you will get an obviously abnormal resource utilization report: Logic utilization (423226%), ALUTs (502536%), Dedicated logic registers (82%), Memory blocks(65%), DSP blocks(2%) So you may want to remove them and replace the variable "gid" with 0 and the variable "gsize" with 1. However, the binary (aocx file) cannot work correctly for both emulation and hardware execution. The emulation shows the binary has some problems: Using AOCX: decom_comp.aocx Error: Malformed program interface definition found in binary: Error: FAILED to read auto-discovery string at byte 44. Full auto-discovery string value is 19 emulatorhash0000000000000000000000000000 Emul atorDevice 0 0 0 0 3 decomp 0 0 0 0 0 0 0 1 0 10 2 1 8 2 1 8 2 1 8 2 1 8 0 0 4 0 0 4 2 1 8 2 1 8 0 0 4 0 0 4 0 0 0 0 0 1 2147483647 0 1 comp ress 0 0 0 0 0 0 0 1 0 13 2 1 8 2 1 8 0 0 4 0 0 4 0 0 4 2 1 8 2 1 8 0 0 4 0 0 4 2 1 8 2 1 8 0 0 4 0 0 4 0 0 0 0 0 1 2147483647 0 1 merge_str eams 0 0 0 0 0 0 0 1 0 8 2 1 8 2 1 8 2 1 8 2 1 8 2 1 8 0 0 4 0 0 4 0 0 4 0 0 0 0 0 1 2147483647 0 1 Error: Invalid binary ERROR: CL_INVALID_BINARY Location: ../common-fpga/src/AOCLUtils/opencl.cpp:392 Failed to create program with binary The hardware version's launching shows similar errors. Some additional messages are: acl_hal_mmd.c:1393:assert failure: Failed to initialize kernel interface acl_hal_mmd.c:1393: l_try_device: Assertion `0' failed. So I think there may be some mismatch across Quartus 17.1.1, OpenCL RTE 19.1 and the underlying bsp 1.1 So for now, it is better for me to stick with 17.1.1 Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Hi HRZ, @HRZ ​ Thank you so much for spending time looking at our code and writing so much feedback!! Actually I have multiple similar versions of this code. In the one shown here, I used local memory for two key struct variables frequently accessed by many stream related functions: __local zfp_stream zfp[MAX_SEG]; __local bitstream stream[MAX_SEG]; (in the kernel “decomp” and “compress”, respectively). You may notice the global pointer arguments “__global zfp_stream * restrict zfp2, __global bitstream * restrict stream2” which is not used here and is another implementation where zfp and stream are put in global memory. Some members in stream, like "buffer", "bits" and "i" (current read/write position) are accessed in many called functions. Removing some assignment statements to them (causing the warnings) make the emulation's results incorrect. Though emulation cannot emulate the concurrency, but it can tell us whether a function is correct from the perspective of logic (please correct me if I'm wrong). For other buffers like xy_buffer and xy_bs1 in the kernel "decomp", they may be too big to be put into the local memory (e.g., for a 2048 x 2048 double matrix, xy_buffer occupies 2k x 2k x 8 = 32 MB bytes). About the barrier, as you see from the code, our framework can be constructed as a 3-stage decompression -> processing -> compression. The processing could be any kind of computation (e.g., processing of one image, transposing of one matrix, etc., I did not show its code here). There needs synchronization between two consecutive stages. I once used the barrier to synchronize between stages in one of earlier versions in which only one kernel is used. Later on I found that is inefficient. So I breaked the single big kernel into four ones (3 of them are shown in the code here). The synchronization between them is controlled by the opencl events in the host side. So this becomes a barrier-free design. If you look at the main compression loop (in codec_2d_public.h) (Input: data to be compressed "xy_buffer", Output: bitstream buffer "begin") for(int b = start_b; b < start_b + nblock; b++) { zfp_encode_block_double_2(begin, stream, zfp, xy_buffer + b * BLOCK_ITEMS); } you can see what I want to do is: split a xy plane (like an image) into multiple regions and one region contains nblock 4x4 blocks. So each work item just compresses one region. The above loop should be executed by all workitems in parallel but they access different regions in a big chunk of global memory. Zfp and stream contain some control data, like current bitstream read/write position, etc. Therefore, actually there is no any data sharing among the work items. There is also no conflict or overlapping between them. (One potential synchronization across work items happen between compression and merge_streams, but it also can be done in the host side) The decompression loop is similar to the compression one (its input: bitstream buffer xy_bs1, output: xy_buffer) Unfortunately, my code’s ndrange version is not stable. For small matrix size (like 64 x 64), it works well; but for large ones (like 256x 256 or 512x512), only using one work item shows correct results; using more than one gave me wrong results most of the time. I am still not able to find the root cause of this phenomenon. For the task version (using __attribute__((task))), there also exist a weird but interesting bug: If put zfp or stream in local memory (shown in the code) or private memory (defined as: zfp_stream zfp; bitstream stream), the results are not correct; but if I put them into global memory by defining them as the global pointer, the results are always correct. Still, I don't know what exactly happened behind this (though logically I cannot see any wrong things). I once suspected if something is wrong with the alignment of zfp or stream. But even if I changed the alignment size in their definitions (codec_2d.h) (like 256), such problems still exist. “ You seem to be under the impression that you can convert an NDRange kernel to Single Work-item just by adding "__attribute__((task))" to the kernel header. This is indeed not the case .. ." You are totally correct! I did not realize this until yesterday night I tried the latest 19.1. With 17.1.1, I can simply use __attribute__((task)) even though the "get_global_id()" or "get_global_size()" still exist in the code. The reports generated by the initial compilation show the code is indeed compiled into a single work item type and most loops are pipelined if possible (but its real underlying implementation may not follow the correct logic even though 17.1.1 successfully compiled it. I have not idea if the bug I mentioned above is related to this). However, with 19.1, the __attribute__((task)) is not supported any more and cannot be identified by the compiler. I have to use "__attribute__((max_global_work_dim(0)))" instead. In this case, if I still leave "get_global_id()" or "get_global_size()" in my code, I would get a obviously incorrect report: Logic utilization (423226%), ALUTs (502536%), Dedicated logic registers (82%), Memory blocks(65%), DSP blocks(2%) After I removed all "get_global_id()" or "get_global_size()" and replaced all "gid" with 0, the report looks normal. Please note: __attribute__((reqd_work_group_size(1, 1, 1))) cannot make the 19.1 identify the code as a single work-time type (it is still be viewed as a ndrange type). "Remember that just because the code works fine in the emulator it does not mean it is actually correct." I've been stuck in such kind of problems for more than one month. For all my implementation versions, their emulations are always correct. But their hardware implementation are not necessary. The compression software zfp has not yet provided a FPGA implementation (their GPU version is published just recently). Is it possible that they already tried the FPGA but found it is inefficient? I think I need to contact the authors. For the number of resource utilization across different Quartus versions, yes, they are from the first stage of compilation (it takes 1 ~ 2 mins). For the number from 18.1.1, I tried the compilation several times and 18.1.1 always give similar numbers. With 17.1,1, I always get some warnings like "Compiler Warning: Auto-unrolled loop at file_path: 40 (line number)" if I did not use the "#pragma unroll N". That are exactly the auto-unrolling you mentioned. But with 19.1, they are gone. So you are right, this function probably has been removed (or disabled) in 19.1 Finally, would you like to consider a possible cooperation with us if you have interest and time? Currently I am the only programmer in this project but I don't have much experience. If you would like to join, we would consider you are one contributor of our project and add your name in our paper we would submit in the future :) Thank you again! Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" I had a look at your code, there are some fundamental issues in the code which raise doubt about its correctness: -You are supposedly using local memory but there is not even one local memory barrier in the whole code (or at least I can't see any). This can only mean two things: 1-The code is incorrect. 2-Every local memory read is only done by the same work-item that initially wrote to that location and there is no data sharing between the work-items through local memory (or else barriers would have been required). This essentially means there is no point in using local memory for this code and the compiler's decision in removing the local memory operations is correct. -You seem to be under the impression that you can convert an NDRange kernel to Single Work-item just by adding "__attribute__((task))" to the kernel header. This is indeed not the case (unless the compiler nowadays has the capability to automatically convert NDRange kernels to Single Work-item but I doubt it). You need to replace all NDRange-specific functions like "get_global_id", "get_local_id", etc. with appropriate loops so that the Single Work-item version performs the same number of operations as the NDRange one with loops iterations rather than work-items. I am surprised the compiler actually compiles the code like this! Remember that just because the code works fine in the emulator it does not mean it is actually correct. The emulator does not fully replicate hardware execution environment and specifically, it cannot correctly emulate concurrency issues and race conditions that might exist in your code. Porting code optimized for CPUs/GPUs to FPGAs is extremely inefficient. You will essentially have to "de-optimize" the code first and then re-optimize it for FPGAs. Moreover, considering the size of your code and all the functions involved in it, debugging this code will not be very different from looking for a needle in a haystack. I would recommend starting from an unoptimized implementation instead (if you can get such implementation from the code authors); that would save a lot of time (and headache). >For 17.1.1: they are 58%, 33%, 28%, 59%, 4%, respectively >For 18.0, they are: 57%, 31%, 28%, 62% and 4%, respectively. >For 19.1, they are 52%, 29%, 24%, 57%, 3%, respectively. >But for 18.1, they are: 85%, 54%, 36%, 70% and 14% , respectively. Are these post-place-and-route utilization numbers or resource estimation numbers from the first stage of compilation? The latter is highly unreliable and you should not make any conclusions based on that. Since the compiler is auto-unrolling some of the loops, and the auto-unroll logic might change from one compiler version to another, this could also change resource utilization drastically. In fact, auto-unrolling loops seems to have been completely disabled in v19.1, which means your code will definitely behave differently with this version compared to previous ones. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Good to hear that, btw, how do you access the Intel Harp machine? I will look into requesting upgrade it to 1.2 version Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" I've been running it for more than 15 minutes and the errors have not yet appeared, which seems a good sign! Previously the errors happened within 5 minutes after I launched it. I will let you know once the compilation is finished successfully. For ndrange (#define attr_setup attr_max_wg), the compilation usually needs 4 ~ 5 hours under 17.1.1; for task (single work item) (#define attr_setup attr_task), the compilation usually is much faster and needs 2 hours under 17.1.1. Here is my configuration script: sys_quartus_dir=/export/quartus_pro my_quartus_dir=$HOME export QSYS_ROOTDIR=${sys_quartus_dir}/17.1.1/qsys/bin export QUARTUS_ROOTDIR=${sys_quartus_dir}/17.1.1/quartus/bin export QUARTUS_ROOTDIR_OVERRIDE=${sys_quartus_dir}/17.1.1/quartus export PATH="${sys_quartus_dir}/17.1.1/quartus/bin/:$PATH" export PATH="${sys_quartus_dir}/17.1.1/qsys/bin:$PATH" # PAC_A10 BSP version 1.1 export AOCL_BOARD_PACKAGE_ROOT=/export/fpga/release/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp # export PATH="$my_quartus_dir/intelFPGA_pro/19.1/hld/board/a10_ref/:$PATH" # set OpenCL version 19.1 export ALTERAOCLSDKROOT=$my_quartus_dir/intelFPGA_pro/19.1/hld export INTELFPGAOCLSDKROOT=$my_quartus_dir/intelFPGA_pro/19.1/hld export PATH=/homes/hiratz/intelFPGA_pro/19.1/hld/bin:$PATH #run the OpenCL Setup script in 19.1 source $my_quartus_dir/intelFPGA_pro/19.1/hld/init_opencl.sh /export/fpga/bin/qsub-fpga Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Then, you try on 1.1 first. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Thanks for your suggestions! Unfortunately, there is no bsp 1.2 on the Intel Harp machine, and I was not able to install it there because installing it needs sudo permission. So I cannot try the "AOCL_BOARD_PACKAGE_ROOT set to 1.2". Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Actually, what we suggest is that: INTELFPGAOCLROOT set to Q19.1 QUARTUS_ROOT set to Q17.1.1 AOCL_BOARD_PACKAGE_ROOT set to 1.2 Let me know if you get any error? I tested it from my side and I am getting an error. This PAC should works with OpenCL 19.1 RTE, and with 17.1.1 quartus. Thanks Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" @HRZ ​ KTan9 has posted it again below. (Please understand this is still part of my current work :), so I just put it here for several days) Let's hope KTan9 (as the Intel expert) would shed some light on this problem. Here I put some interesting data across different Quartus versions: For my complete kernel files, the resource utilization breakdowns across various versions are as follows: Logic utilization, ALUts, Dedicated Logic registers, DSP blocks For 17.1.1: they are 58%, 33%, 28%, 59%, 4%, respectively For 18.0, they are: 57%, 31%, 28%, 62% and 4%, respectively. For 19.1, they are 52%, 29%, 24%, 57%, 3%, respectively. But for 18.1, they are: 85%, 54%, 36%, 70% and 14% , respectively. (It looks like the latest 19.1 has the best resource utilization! (at least for my design)) I asked this in the Harp community and was told that a change that large suggests the compiler made a significantly different topological decision. These observations are for your reference. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Here you go. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" @hiratz ​ Can you post your design files again? The file you uploaded before is not available anymore. I don't think your problem will go away by using a newer versions of Quartus anyway; the optimizations done by the compiler do not change drastically from one version to another. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Thanks for the information. I just downloaded the Intel® Acceleration Stack Version 1.2 (a10_gx_pac_ias_1_2_pv_rte_installer.tar.gz) ( https://www.intel.com/content/www/us/en/programmable/solutions/acceleration-hub/downloads.html ), but I was not able to install it on the vLab machine because doing so needs sudo permission I don't have. I believe currently only a10_gx_pac_ias_1_1_pv is installed. Sorry about that. If you have sudo permission, you can try it instead. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Is the dcp 1.2 a newer BSP for Intel pac a10? Unfortunately I do not know where it is in the vLab machine. I do find a directory "17.1.1_pac1.2" in the /export/fpga/tools/quartus_pro/, but all files in it are actually symbol links to /export/fpga/tools/quartus_pro/17.1.1 Or can I download it from some place? Thanks Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" https://www.intel.com/content/www/us/en/programmable/documentation/mwh1391807309901.html#mwh1391807297091 Intel FPGA SDK for OpenCL Pro Edition and BSP Backwards Compatibility To use an older BSP with the Intel® FPGA SDK for OpenCL™ , you must have a version of Intel® Quartus® Prime Pro Edition with the same version number as your BSP. For example, to use a Version 18.1 BSP with Intel® FPGA SDK for OpenCL™ Pro Edition Version 19.1, you need Intel® Quartus® Prime Pro Edition Version 18.1. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Hi, Can you try acl19.1 with dcp1.2 (acds17.1.1). let me know if you get problem Thanks, Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" I see. Thank you! I look forward to hearing good news from you. Best Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Hi, You cannot use Q19.1 for pac_10. I uses a different board for testing purposes only. I will have to get back to you on how to remove the warning if possible. You may have to stick with Q17.1.1. Thanks Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Hi KTan9 @KennyT_Intel ​ and HRZ @HRZ ​ , ​ I'm sorry I have to bother you again. ​ I installed Quartus 19.1 in my home directory successfully, but some bsp related errors happened (I will show them later) ​ According to this post ("Using A10 PAC BSP with OpenCL SDK 18.1", https://forums.intel.com/s/question/0D50P00004894Hn/using-a10-pac-bsp-with-opencl-sdk-181?language=en_US ), one can compile OpenCL kernel using latest intel SDK with older BSP version . Though @FJumaah ​ shows a detailed configuration procedure, I cannot make it by directly using it (at least at Intel vLab's pac_a10 , no a generic Arria 10 ). ​ And then, I tried the second configuration method: ​ I copied the configuration directory "/export/fpga/bin" to my home "/fpga/bin" and did the following changes: 1) In ./fpga/bin/sh/fpga_classes, change "fpga_quartus_version[fpga-pac-a10]="17.1.1" to "19.1" (or "18.1", etc.) 2) In all set-*-env files in ./fpga/bin, change "SCRIPT_DIR="/export/fpga/bin" to "$HOME/fpga/bin" 3) In setup-synth-env, for line 52 - line 57, change "/export/fpga/tools/quartus_pro" to "$HOME/intelFPGA_pro" ((For 18.1 and 17.1.1, the 3) are not applied because they are in /export/fpga/tools/quartus_pro) ​ Then I run "source $HOME/fpga/bin/setup-fpga-env fpga-pac-a10 qsub-fpga" as before. I tested the above modified script with 17.1.1, 18.1 and 19.1. Only 17.1.1 can work well. Both 18.1 and 19.1 show bsp related errors. The configured environmental variables and partial errors are shown here respectively (I will attache the file "quartus_sh_compile.log" as well) INTELFPGAOCLSDKROOT is set to /homes/hiratz/intelFPGA_pro/19.1/hld. Using that. Will use $QUARTUS_ROOTDIR_OVERRIDE= /homes/hiratz/intelFPGA_pro/19.1/quartus to find Quartus AOCL_BOARD_PACKAGE_ROOT is set to /export/fpga/release/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp. Using that. Adding /homes/hiratz/intelFPGA_pro/19.1/hld/bin to PATH Adding /homes/hiratz/intelFPGA_pro/19.1/hld/host/linux64/lib to LD_LIBRARY_PATH Adding /export/fpga/release/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp/linux64/lib to LD_LIBRARY_PATH Configured FPGA environment for fpga-pac-a10: Quartus: /homes/hiratz/intelFPGA_pro/19.1/quartus Platform: /export/fpga/release/a10_gx_pac_ias_1_1_pv OPAE: /export/fpga/opae/install/opae-install-20190112 Starting interactive job on queue fpga-pac-a10 qsub: waiting for job 132804.iam-pbs to start qsub: job 132804.iam-pbs ready You can see the bsp I'm using is /export/fpga/release/a10_gx_pac_ias_1_1_pv which is the default bsp specifically for pac_a10. (Note that the generic a10 bsp is in /homes/hiratz/intelFPGA_pro/19.1/hld/board/a10_ref/). The SDK I'm using is /homes/hiratz/intelFPGA_pro/19.1/quartus. Both "ALTERAOCLSDKROOT" and "INTELFPGAOCLSDKROOT" are set to "/homes/hiratz/intelFPGA_pro/19.1/hld" ​ The bsp errors are as follows: aoc: Linking with IP library ... aoc: Checking if memory usage is larger than 100%... aoc: Memory usage is not above 100. Compiler Warning: addpipe in board_spec.xml is set to 1 which is no longer supported Compiler Warning: global memory pipeline stage is now implemented in BSP instead aoc: First stage compilation completed successfully. Compiling for FPGA. This process may take a long time, please be patient. Error (16045): Instance "ccip_std_afu|bsp_logic_inst|board_inst" instantiates undefined entity "board" File: /homes/hiratz/ndr-test/decom_co mp/build/bsp_logic.sv Line: 133 Error (16185): Can't elaborate user hierarchy "ccip_std_afu|bsp_logic_inst|board_inst" File: /homes/hiratz/ndr-test/decom_comp/build/bsp_log ic.sv Line: 133 Error (16185): Can't elaborate user hierarchy "ccip_std_afu|bsp_logic_inst" File: /homes/hiratz/ndr-test/decom_comp/build/BBB_cci_mpf/hw/rtl /cci-mpf-if/cci_mpf_if.vh Line: 38 Error (16185): Can't elaborate user hierarchy "ccip_std_afu" File: /homes/hiratz/ndr-test/decom_comp/build/platform/green_bs.sv Line: 183 Error (16186): Can't elaborate top-level user hierarchy Error: Flow failed: Error: Quartus Prime Synthesis was unsuccessful. 6 errors, 413 warnings Error (23035): Tcl error: ERROR: Error(s) found while running an executable. See report file(s) for error message(s). Message log indicates which executable was run last. Error (23031): Evaluation of Tcl script a10_partial_reconfig/flow.tcl unsuccessful Error: Quartus Prime Shell was unsuccessful. 12 errors, 413 warnings Error: Compiler Error, not able to generate hardware For 18.1, it caused similar bsp errors to the above. ​ Note: The above bsp errors only happened for complete compilation to the hardware. The quick initial compilation for a report (aoc -report -v -rtl -I./device device/decom_comp.cl -board=pac_a10) works well. ​ kTan9 @KennyT_Intel ​ : My 19.1 compilation showed neither "Aggressive compiler optimization: removing unnecessary storage to local memory" nor "Aggressive compiler optimization: pushing out local memory contents". ​ So I guess there must be something wrong with my configurations but I just cannot figure it out after I tried it again and again. Would you please show me a correct configuration procedure you used for pac_a10 with Quartus 19.1 or 18.1? If you need more information, please let me know. ​ I really appreciate your help! ​ ​ Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Thanks, HRZ! I will look at the setup scripts on vLab and see if I can configure it with a different compiler version. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" http://fpgasoftware.intel.com/19.1/?edition=pro Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Thanks! If a newer version can solve this problem, that will be great. I'll appreciate it if you can guide me to have access to the Q19.1. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" the message is actually in the decom.log files. I try it on Q17.1 and it shows that. But using Q19.1 this message no longer there. I will get back to you on this. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" The attached zip file (Screen_Output.zip) contains two screenshot pictures. You can compare them with yours. Currently Intel vLab's machine is the only platform I can use. Q17.1.1 is their default version for the pac_a10 board. I'm not sure if I can choose a different version as a customer (You can see the computing node "vsl111" assigned to me this time from the attached screenshot). Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" I am using Q19.1 version, which had been release yesterday. May be you can put your screenshot here so that I can compare it from my side. Also, can you try on Q19.1? Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" No, I did not get the error "Aggressive compiler optimization: pushing out local memory contents". What I got was the warning "Aggressive aocl optimization: removing unnecessary storage to local memory". What's your aocl version? My version is "aocl 17.1.1.273 (Intel(R) FPGA SDK for OpenCL(TM), Version 17.1.1 Build 273". I did not see any related report file. These warnings were shown on the screen. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" I was able to compile your design, did you get error something like: "Aggressive compiler optimization: pushing out local memory contents" I try to locate this in the report but was not able to find. where do you see this warning? can point me to the correct log files? Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" I know this problem. In the Programming guide, the attribute max_work_group_size has three parameters: like " __attribute__((max_work_group_size(X, Y, Z)))". However, this made my compilation failed. Later on, I noticed that this post ( https://forums.intel.com/s/question/0D50P00003yyRoGSAU/error-when-specifying-max-work-group-size?language=en_US ) says it should be used with only one parameter. So I used it like "__attribute__((max_work_group_size(256)))" and it can be compiled successfully in Intel vLab machine. So you can try "__attribute__((max_work_group_size(256, 1, 1)))" instead. BTW, I tested my files before I uploaded them and they were compiled successfully. Let me know if you have other issues when you compile them. Thanks! Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" I am getting some error below using your .cl files, is there any things that needed to change in your code? /data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:49:1: error: 'max_work_group_size' attribute requires exactly 3 arguments attr_setup ^ /data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:46:20: note: expanded from macro 'attr_setup' #define attr_setup attr_max_wg ^ /data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:40:36: note: expanded from macro 'attr_max_wg' #define attr_max_wg __attribute__((max_work_group_size(256))) ^ /data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:83:1: error: 'max_work_group_size' attribute requires exactly 3 arguments attr_setup ^ /data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:46:20: note: expanded from macro 'attr_setup' #define attr_setup attr_max_wg ^ /data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:40:36: note: expanded from macro 'attr_max_wg' #define attr_max_wg __attribute__((max_work_group_size(256))) ^ /data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:111:1: error: 'max_work_group_size' attribute requires exactly 3 arguments attr_setup ^ /data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:46:20: note: expanded from macro 'attr_setup' #define attr_setup attr_max_wg ^ /data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:40:36: note: expanded from macro 'attr_max_wg' #define attr_max_wg __attribute__((max_work_group_size(256))) Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Hi KTan9, I really appreciate your help and time! I just attached all (.h and .cl) files (as a .zip file, Hiratz_device.zip) that are needed by the compilation. The main .cl file is decom_comp.cl that includes all other eight .h files. I'm implementing the zfp floating point compression algorithm ( https://computation.llnl.gov/projects/floating-point-compression ) (version 0.5.4). The kernel "decomp" is for decompression, and the kernel "compress" is responsible for compression. Naturally, zfp's block-level decompression/compression is very suitable for GPU's SIMD processing. For example, given a N x N matrix, it can be split into N_SEG segments or sections that are then processed by multiple work items at the same time. However, I'm still hoping that I can achieve the same acceleration effect on Intel FPGA with all possible optimizations. Unfortunately, there are still some problems with my current version I attached here. For its NDRange version, besides the warnings I mentioned, the num_simd_work_items(n) also cannot work if I put them before the kernel function. If I increases the number of work items and the matrix dimension size (i.e., N), the kernel's running is not stable any more, which sometimes shows wrong results (the emulation always show the correct results). For its task (single work item) version, the current version cannot give correct results. If I replace "zfp[MAX_SEG]" and "stream [MAX_SEG]" with global pointers in the kernel arguments (using global memory), the kernel can work well but it is quite slow. Two local memory arrays, "zfp[MAX_SEG]" and "stream [MAX_SEG]", are the culprits that cause the warnings I mentioned before. But according to the program semantics, all related code lines that cause these warnings should not be removed (otherwise the emulation running will give wrong results). The most complex two nested loops are in "decode_ints_uint64(...) " in decode.2h and in "encode_ints_uint64" in encode_2d.h, respectively. Both look similar and each one is 3-level nested loop and the inner 2-level nested loop is so irregular that I cannot find any effective methods to rewrite or optimize it. In addition, I also tried to use channel to transfer the lseg_size[] from the kernel "compress" to the kernel "merge_stream" but failed. The reason is related to the kernel "compress". Overall, it seems that most mentioned optimizations in the manual of "Best Practices Guide" are hard to be applied to this application ... The compilation commands lines I am using are as follows: 1) for quick resource utilization report aoc -v -c -I./device device/decom_comp.cl -o decom_comp.aocx -g board=pac_a10 2) complete compilation aoc -v -I./device device/decom_comp.cl -o decom_comp.aocx -g board=pac_a10 Either 1) or 2) will show the warnings I mentioned before if you run them on Intel vLab (Harp) machine. But eventually it will be compiled successfully. I've been stuck in the above-mentioned problems for over half a month. I appreciate your help so much! Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Why dont you attached the *.cl files here so that we can start comment about it? Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Thanks for you reply. Actually my kernels are quite complex (I'm porting a data compression/decompression algorithm into the FPGA from its C version) and contain a lot of complex loops (some have fixed number of iterations known during the compilation time, some do not). If I compile them with "__attribute__((task))", the report.html shows some loops are pipelined but some others not. If I compile them into NDRange, no loops are pipelined. What do you mean by "missing variable"? My four kernels can be compiled successfully for both emulation and real hardware. Only the compilation for hardware shows this warning if I use some key local memory variables. (If I use global or private ones, there are no such warnings) The emulated version always gives correct results. Replies: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" Hi, 1. Make sure there are no missing variable in your .cl code 2. If there are no loops and the code is simple, use NDRange 3. If there are loops, especially that can be pipelined, use single workitem (EnqueueTask) Let me know if it helps? - 2019-03-29

external_document